Announcing alluxio 2.0

Alluxio 2.0 adds major capabilities to simplify and accelerate multi-cloud, data analytics and AI workloads. It is Alluxio’s largest open source release with the most new features and improvements since the creation of the project.

Dive into all of the Alluxio 2.0 features

data orchestration for multi-cloud

Automate data movement across storage systems based on pre-defined policies on an automated and on-going basis. This means that as data is created and hot, warm, cold data is managed, Alluxio can automate tiering of data across any number of storage systems across on-premises and across all clouds. 

With this, data platform teams can reduce storage costs by automatically managing only the most important data in expensive storage systems and moving other data to cheaper storage alternatives.

In addition to fine grained policies at the file level, now users can configure policies at any directory and folder level to streamline access of data as well as performance of workloads. These include defining behaviors for individual datasets on various core functions like writing data or syncing data with storage systems under Alluxio.

The new data service allows for highly efficient data movement including across cloud stores like AWS S3 andGoogle GCS, making expensive operations on object storage seamless to the compute framework.

support for hyper-scale data workloads

2.0 introduces a new option for tiered metadata management to support single cluster deployments with more than a billion files. We use RocksDB for off heap storage which is now the default. Metadata for hot data continues to be stored in the process memory on heap while the rest is managed by Alluxio outside the process memory. alluxio.master.metastore can be configured to change to only heap.

2.0 introduces the Alluxio Job Service, a distributed clustered service, that data operations such as replication, persistence, cross storage move and distributed load now use, for enabling high performance and massive scale. Take a look at all the file system APIs Alluxio supports.

New feature to configure a range for the number of copies of data stored in Alluxio that are automatically managed. alluxio.user.file.replication.max and alluxio.user.file.replication.min can be used to specify a range. A full list of all the user configurations can be found here.

A new fault tolerance and high availability mode for file and object metadata called the embedded journal that uses the RAFT consensus algorithm and is independent of any other external storage systems. This is particularly helpful for abstracting object storage. Learn about configuring embedded journal here.

Better Storage Abstraction for Completely Independent and elastic compute

Explosive growth of data has led enterprises to have many data silos including multiple Hadoop clusters across many different versions. Unified access across these clusters is currently very difficult. With Alluxio 2.0, users can connect to multiple HDFS clusters with any version to Alluxio and unify data access across them. Find the list of supported HDFS versions here.

New capability integrates with HDFS iNotify to update any data and metadata changes that happen to files stored in Hadoop allowing for applications accessing data via Alluxio to proactively receive the latest updates.

Compute Optimized Data Access for Cloud Analytics

Users can now partition a single Alluxio based on any dimension, so that datasets for each framework or workload isn’t contaminated by the other. Most common usage includes partitioning the cluster by framework Spark, Presto etc. In addition, this allows for reduced data transfer costs, constraining data to stay within a specific zone or region.

Users can now bring in data even from web-based data sources to aggregate in Alluxio to perform their analytics. Any web location with files can be simplify pointed to Alluxio to be pulled in as needed based on the query or model run. 

Amazon aWS support

As users move to cloud services to deploy analytical and AI workloads, services like AWS EMR are increasingly used. Alluxio can now be seamlessly bootstrapped into an AWS EMR cluster making it available as a data layer within EMR for Spark, Presto and Hive frameworks. Users now have a high-performance alternative to cache data from S3 or remote data while also reducing data copies maintained in EMR.

Architectural Foundations Using Open Source

RocksDB is now used for tiering metadata of files and objects for data that Alluxio manages to enable hyperscale

Google’s highly efficient version of RPC is now the core transport protocol used for communication within the cluster as well as between the Alluxio client and master, making communications more efficient.