Whats new in Alluxio 2.5

April 15, 2021

Adit Madan

We are thrilled to announce the release of Alluxio 2.5!

Alluxio 2.5 focuses on improving interface support to broaden the set of data driven applications which can benefit from data orchestration. The POSIX and S3 client interfaces have greatly improved in performance and functionality as a result of the widespread usage and demand from AI/ML workloads and system administration needs. Alluxio is rapidly evolving to meet the needs of enterprises that are deploying it as a key component of their AI/ML stacks.

Downloads can be found here. Join thousands of members in our Slack channel to ask any questions and provide your feedback! Thank you to everyone who contributed to this release!

Data Orchestration for AI/ML Workloads

Alluxio’s Data Orchestration capabilities are immensely valuable for improving the performance and data pipelining of AI/ML workloads. For example, Alibaba saw over 40% improvement in training time and cost improvement by deploying Alluxio (article).

AI/ML workloads naturally use high-spec machines with expensive GPUs and pairing these GPUs with the appropriate I/O is critical for training efficiency and cost effectiveness. The costs of hardware combined with long training times make acceleration a key goal for our users. By deploying Alluxio on these machines, users benefit from both distributed, high-performance storage and data management functionalities. Specifically, our users see the necessity of the Alluxio layer to fuel growing GPU I/O demand, which is outpacing object storage/network I/O growth. Finally, we observed that our users were able to run Alluxio with only the underutilized resources such as memory, disk, and CPU on the GPU nodes, resulting in no additional cost or deployment overhead.

While Alluxio fits well into the AI/ML architecture, we still needed to overcome the challenges of API compatibility. Applications like Tensorflow and PyTorch most commonly use a POSIX API as opposed to the HDFS-compatible API for analytics workloads, so the Alluxio FUSE layer was a natural fit. In order to further improve the performance and capabilities of the interface, we implemented our own JNI FUSE layer which is a replacement for the legacy JNR FUSE based integration. JNI FUSE already solves compatibility issues and provides better latency and throughput in highly concurrent workloads, and we expect to further enhance the capabilities in upcoming Alluxio releases.

For further reading, check out this presentation by Microsoft and consider joining our special interest group which meets weekly to discuss on-going development.

Cloud Native Integrations

A large portion of Alluxio users are deployed in the cloud, and therefore the Alluxio system is committed to integrating with the cloud ecosystem in the most advanced ways. Alluxio 2.5 introduces improvements for all three major public cloud providers, AWS, GCP, and Azure, as well as the defacto standard in container orchestration, Kubernetes.

The latest connectors to cloud storage enable users to benefit from the recommended security models in the cloud such as AWS’s Security Token Service (STS) and GCP’s service account keys. We have also introduced native support for Azure Data Lake Storage Gen 2, which is the recommended service for building big data applications on Azure. ADLS Gen 2 provides file-level semantics and optimizations as well as security.

For further reading, check out the docs for AWS, Azure, and GCP.

More Info

Want to hear from the core developers? Join us for the live webinar on the 2.5 release!

You can find more information in the 2.5.0 official release notes.Have questions? Come join the Community Slack Channel.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo