Introducing Alluxio 2.3

July 1, 2020

We are extremely excited to announce the release of Alluxio 2.3.0!

Alluxio 2.3.0 focuses on streamlining the user experience in hybrid cloud deployments where Alluxio is deployed with compute in the cloud to access data on-prem. Features such as environment validation tools and concurrent metadata synchronization greatly improve Alluxio’s functionality. Integrations with AWS EMR, Google Dataproc, K8s, and AWS Glue make Alluxio easy to use in a variety of cloud environments. In this article, we will share some of the highlights of the release. For more, please visit our release notes page.

Downloads can be found here. Join thousands of members in our Slack channel to ask any questions and provide your feedback! Thank you to everyone who contributed to this release!

Significant Adoption of Hybrid Cloud

The trend of moving to cloud is undeniably shaping the industry. Data analytics and machine learning workloads are no exception, but we have seen many Alluxio users prefer the hybrid cloud approach, as opposed to a lift and shift. Alluxio’s ability to enable zero copy bursting of compute to cloud has proved invaluable in enabling organizations to begin leveraging the cloud.

Alluxio 2.3 addresses several key usability challenges and further improves the system’s effectiveness in hybrid deployments.

One Command Deployment on AWS EMR and Google Dataproc

Try out our example hybrid cloud deployment on AWS EMR or Google Dataproc.

Deploying Alluxio for the first time should be easy, and being able to repeatably create custom deployments with Alluxio in the stack is key for deployments in the cloud. Cloud resources are often elastic or ephemeral, as opposed to the long term maintenance model commonly used in on-premise deployments.

Alluxio artifacts have been published for integration with terraform scripts. Experienced users can use the assets provided (see one of the above tutorials for details) as a basis for building their own terraform deployments. Note this is currently only available in the Enterprise Edition.

Environment Validation Tools

After deployment, the hurdle of connecting on-cloud Alluxio to remote data is the biggest challenge for new Alluxio users. We’ve created a guided experience to help users during this first step after deployment.

The Alluxio Enterprise Edition has a remote connectivity page in the UI which troubleshoots and validates the entire mounting process.

Both the Community and Enterprise Editions have three new validation tools to help users troubleshoot issues in their deployments. These tools are all part of the command line bin/alluxio

runHdfsMountTests checks configuration for mounting the target HDFS path to Alluxio.

runUfsIOTest measures the read/write IO throughput from Alluxio cluster to the target HDFS.

runHmsTests validates the given configuration is sufficient to run Hive Metastore operations.

Concurrent Metadata Synchronization

For long running and production hybrid cloud deployments, users found it critical for the files and directories virtualized in Alluxio to be synchronized with the on-premise data in near real time. This previously was not feasible for namespaces with a large number of files.

In Alluxio 2.3 the new concurrent metadata synchronization algorithm provides an order of magnitude or more performance improvement, especially for large namespaces with concurrent operations.

Alluxio Structured Data Services

Alluxio is most commonly used in OLAP big data workloads with frameworks like Presto and SparkSQL. Alluxio Structured Data Services (SDS) is the subsystem in Alluxio that enables integration with those frameworks at the structured data level, as opposed to raw files and directories. Read more about SDS here.

Alluxio 2.3 further improves the range of compatibility for SDS, especially in cloud environments.

Glue UDB Support

The Alluxio Catalog Service now supports connecting to AWS Glue for the metadata service. This enables Alluxio Structured Data Services for table metadata stored in AWS Glue, in addition to the existing support for the Hive Metastore.

ORC File Support

ORC is now a supported input type (in addition to CSV and Parquet) for transformations with the Alluxio Catalog Service.

More Info

You can find more information in the 2.3.0 official release notes.

Want to hear from the core developers? Join us for a webinar on the 2.3 release!

Have questions? Come join the Community Slack Channel.

Zac, Calvin, Bin, Adit, and Alluxio Product Team

Share this post

Blog

How Blackout Power Trading Achieved Multi-Join Double-Digit Millisecond Latency Offline Feature Store Performance with Alluxio Low Latency Caching

In this blog, Greg Lindstrom, Vice President of ML Trading at Blackout Power Trading, an electricity trading firm in North American power markets, shares how they leverage Alluxio to power their offline feature store. This approach delivers multi-join query performance in the double-digit millisecond range, while maintaining the cost and durability benefits of Amazon S3 for persistent storage. As a result, they achieved a 22 to 37x reduction in large-join query latency for training and a 37 to 83x reduction in large-join query latency for inference.

‍

Alluxio AI 3.7: Now with Sub-Millisecond Latency!

Super Boosting Your Agentic AI & Inference Workloads

‍

Alluxio Demonstrates Strong Performance in MLPerf Storage v2.0 Benchmarks

In the latest MLPerf Storage v2.0 benchmarks, Alluxio demonstrated how distributed caching accelerates I/O for AI training and checkpointing workloads, achieving up to 99.57% GPU utilization across multiple workloads that typically suffer from underutilized GPU resources caused by I/O bottlenecks.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo

Alluxio Enterprise AI

Alluxio Enterprise Data