Alluxio 2.0 Preview - Enabling hyper-scale cloud data workloads

We are thrilled and excited to announce the availability of Alluxio 2.0 Preview Release – the largest open source release with the most new features and improvements since the creation of the project. It is now available for download and release notes are available here.

The ideation & design phase

When the core project team started to think about the next big Alluxio release many months ago, there were a few overarching goals that they wanted to achieve. While Alluxio already enabled data locality and data accessibility for many big data workloads in the cloud, there was still innovation needed in key areas.

Design a step-function change in Scale – As the data orchestration layer between compute and storage that makes data mobile and more accessible across many different storage systems: HDFS, objects stores, network attached storage, over time the scale of metadata support that Alluxio needs to provide could easily surpass that of the largest Hadoop deployments. Metadata management in particular has been known as a weak spot for Hadoop, however Alluxio should turn metadata management into a strength.

Power more data-driven workloads – Alluxio was created with a focus on Hadoop-based compute workloads. But over the years, the number of and types of data-intensive compute workloads have exploded and data orchestration and engineering to enable these workloads on existing data or new data storage systems has been non-trivial. In particular, a lot of data engineering including manual data movement is needed prior to machine learning and deep learning training. Alluxio should greatly simplify this by providing a native known API to data scientists while reducing the data engineering required.

Make separation of storage & compute easier – Data silos across the enterprise are only increasing with data across multiple Hadoop clusters, increasingly in many different object stores and in several cases stored on premise or in the public cloud. This has made it harder to disaggregate compute from data, because data locality and access gets severely affected in when the data processing is moved to a different place than where the data is stored. Alluxio should continue to enable separation of compute and storage, by abstracting storage while making data more accessible.

With these lofty goals in mind, the engineering and product teams designed, implemented, tested and stress tested some more, turning Alluxio 2.0 into reality.

The advancements & features

Alluxio 2.0 includes many enhancements to support the design goals of the project, all open sourceand will be included in the Community Edition!

Support for hyper-scale data workloads

Support for more than 1 billion files – 2.0 introduces a new option for tiered metadata management to support single cluster deployments with more than a billion files. We use RocksDB for off heap storage which is now the default. Metadata for hot data continues to be stored in the process memory on heap while the rest is managed by Alluxio outside the process memory. alluxio.master.metastore can be configured to change to only heap.
Highly distributed data services – 2.0 introduces the Alluxio Job Service, a distributed clustered service, that data operations such as replication, persistence, cross storage move and distributed load now use, for enabling high performance and massive scale. Take a look at all the file system APIs Alluxio supports.
Adaptive replication for increased data locality – New feature to configure a range for the number of copies of data stored in Alluxio that are automatically managed. alluxio.user.file.replication.max and alluxio.user.file.replication.min can be used to specify a range. A full list of all the user configurations can be found here
High availability with embedded journal – A new fault tolerance and high availability mode for file and object metadata called the embedded journal that uses the RAFT consensus algorithm and is independent of any other external storage systems. This is particularly helpful for abstracting object storage. Learn about configuring embedded journal here

Enabling machine learning and deep learning workloads on any storage

Machine learning and deep learning frameworks need to extract data from Hadoop and object stores, typically a very manual and time consuming process.

Alluxio POSIX API Alluxio’s FUSE feature enables a POSIX compatible API so that frameworks like TensorFlow, Caffe and other Python-based models can directly access data from any storage system via Alluxio using traditional file system access. Learn more about the POSIX API.

Better storage abstraction for completely independent and elastic compute

Support for HDFS clusters across different versions – Explosive growth of data has led enterprises to have many data silos including multiple Hadoop clusters across many different versions. Unified access across these clusters is currently very difficult. With Alluxio 2.0, users can connect to multiple HDFS clusters with any version to Alluxio and unify data access across them. Find the list of supported HDFS versions here.
Active sync with Hadoop – New capability integrates with HDFS iNotify to update any data and metadata changes that happen to files stored in Hadoop allowing for applications accessing data via Alluxio to proactively receive the latest updates.

The feedback

This is where you come in. With the preview now available, I sincerely hope you give Alluxio 2.0 a try and share your experiences with us – we want to hear about what you are excited about, what you think could work better and what you feel we should focus on next. I personally look forward to hearing your stories. Reach out to us on slack or email: info@alluxio.com.