Here in New York, at the AWS Summit, we are super excited to announce that Alluxio 2.0 is here, our most major release since the Alluxio launch.
A couple months ago, we released 2.0 Preview – which included some of the capabilities, but 2.0 now includes even more, to continue building on to our data orchestration approach for the cloud.
We firmly believe that just like compute and containers need orchestration by something like Kubernetes, data, that’s increasingly siloed and the working set for compute workloads also needs orchestration – data orchestration. Bringing data closer to compute to accelerate jobs, making data more accessible via different APIs and being able to abstract data apps from where the data is stored, are all core concepts we have built on. While you can see the exhaustive list in our release notes, some of the latest areas of focus include:
Breakthrough Data Orchestration Innovation for Multi-cloud
● Policy-driven Data Management
o Alluxio 2.0 includes a new capability that allows data engineers to automate data movement across storage systems based on pre-defined policies on an automated and on-going basis. This means that as data is created and hot, warm, cold data is managed, Alluxio can automate tiering of data across any number of storage systems across on-premises and across all clouds.
o Data platform teams can now reduce storage costs by automatically managing only the most important data in expensive storage systems and moving other data to cheaper storage alternatives.
● Improved Administration of Data Access Policies
o In addition to fine grained policies at the file level, now users can configure policies at any directory and folder level to streamline access of data as well as performance of workloads. These include defining behaviors for individual datasets on various core functions like writing data or syncing data with storage systems under Alluxio.
● Cross Cloud Storage Efficient Data Movement via Data Service
o The new data service allows for highly efficient data movement including across cloud stores like AWS S3 and Google GCS, making expensive operations on object storage seamless to the compute framework.
Compute Optimized Data Access for Cloud Analytics
● Compute-focused Cluster Partitioning
o Users can now partition a single Alluxio based on any dimension, so that datasets for each framework or workload isn’t contaminated by the other. Most common usage includes partitioning the cluster by framework Spark, Presto etc. In addition, this allows for reduced data transfer costs, constraining data to stay within a specific zone or region.
● Integration with External Data Sources Over REST
o Users can now bring in data even from web-based data sources to aggregate in Alluxio to perform their analytics. Any web location with files can be simplify pointed to Alluxio to be pulled in as needed based on the query or model run.
Amazon AWS EMR Support
● AWS Elastic Map Reduce (EMR) Service Integration
o As users move to cloud services to deploy analytical and AI workloads, services like AWS EMR are increasingly used. Alluxio can now be seamlessly bootstrapped into an AWS EMR cluster making it available as a data layer within EMR for Spark, Presto and Hive frameworks. Users now have a high-performance alternative to cache data from S3 or remote data while also reducing data copies maintained in EMR.
Architectural Foundations Using Open Source
Many core foundational elements have been re-architected using the best open source technologies with a vision of hyper scale.
- RocksDB is now used for tiering metadata of files and objects for data that Alluxio manages to enable hyperscale
- GRPC – Google’s highly efficient version of RPC is now the core transport protocol used for communication within the cluster as well as between the Alluxio client and master, making communications more efficient.
We hope you are as excited as we are! Give it a try now. Download Alluxio 2.0!