Introducing Alluxio 2.3

July 1, 2020

We are extremely excited to announce the release of Alluxio 2.3.0!

Alluxio 2.3.0 focuses on streamlining the user experience in hybrid cloud deployments where Alluxio is deployed with compute in the cloud to access data on-prem. Features such as environment validation tools and concurrent metadata synchronization greatly improve Alluxio’s functionality. Integrations with AWS EMR, Google Dataproc, K8s, and AWS Glue make Alluxio easy to use in a variety of cloud environments. In this article, we will share some of the highlights of the release. For more, please visit our release notes page.

Downloads can be found here. Join thousands of members in our Slack channel to ask any questions and provide your feedback! Thank you to everyone who contributed to this release!

Significant Adoption of Hybrid Cloud

The trend of moving to cloud is undeniably shaping the industry. Data analytics and machine learning workloads are no exception, but we have seen many Alluxio users prefer the hybrid cloud approach, as opposed to a lift and shift. Alluxio’s ability to enable zero copy bursting of compute to cloud has proved invaluable in enabling organizations to begin leveraging the cloud.

Alluxio 2.3 addresses several key usability challenges and further improves the system’s effectiveness in hybrid deployments.

One Command Deployment on AWS EMR and Google Dataproc

Try out our example hybrid cloud deployment on AWS EMR or Google Dataproc.

Deploying Alluxio for the first time should be easy, and being able to repeatably create custom deployments with Alluxio in the stack is key for deployments in the cloud. Cloud resources are often elastic or ephemeral, as opposed to the long term maintenance model commonly used in on-premise deployments.

Alluxio artifacts have been published for integration with terraform scripts. Experienced users can use the assets provided (see one of the above tutorials for details) as a basis for building their own terraform deployments. Note this is currently only available in the Enterprise Edition.

Environment Validation Tools

After deployment, the hurdle of connecting on-cloud Alluxio to remote data is the biggest challenge for new Alluxio users. We’ve created a guided experience to help users during this first step after deployment.

The Alluxio Enterprise Edition has a remote connectivity page in the UI which troubleshoots and validates the entire mounting process.

Both the Community and Enterprise Editions have three new validation tools to help users troubleshoot issues in their deployments. These tools are all part of the command line bin/alluxio

runHdfsMountTests checks configuration for mounting the target HDFS path to Alluxio.

runUfsIOTest measures the read/write IO throughput from Alluxio cluster to the target HDFS.

runHmsTests validates the given configuration is sufficient to run Hive Metastore operations.

Concurrent Metadata Synchronization

For long running and production hybrid cloud deployments, users found it critical for the files and directories virtualized in Alluxio to be synchronized with the on-premise data in near real time. This previously was not feasible for namespaces with a large number of files.

In Alluxio 2.3 the new concurrent metadata synchronization algorithm provides an order of magnitude or more performance improvement, especially for large namespaces with concurrent operations.

Alluxio Structured Data Services

Alluxio is most commonly used in OLAP big data workloads with frameworks like Presto and SparkSQL. Alluxio Structured Data Services (SDS) is the subsystem in Alluxio that enables integration with those frameworks at the structured data level, as opposed to raw files and directories. Read more about SDS here.

Alluxio 2.3 further improves the range of compatibility for SDS, especially in cloud environments.

Glue UDB Support

The Alluxio Catalog Service now supports connecting to AWS Glue for the metadata service. This enables Alluxio Structured Data Services for table metadata stored in AWS Glue, in addition to the existing support for the Hive Metastore.

ORC File Support

ORC is now a supported input type (in addition to CSV and Parquet) for transformations with the Alluxio Catalog Service.

More Info

You can find more information in the 2.3.0 official release notes.

Want to hear from the core developers? Join us for a webinar on the 2.3 release!

Have questions? Come join the Community Slack Channel.

Zac, Calvin, Bin, Adit, and Alluxio Product Team

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo