Whats New In Alluxio Enterprise AI 3.2: GPU Acceleration, Python Filesystem API, Write Checkpointing and More

July 9, 2024

Shouwei Chen

Performance, cache operability, and cost efficiency are key considerations for AI platform teams supporting large scale model training and distribution. In 2023, we launched Alluxio Enterprise AI, for managing AI training and model distribution I/O across diverse environments, whether in a single storage with diverse computing clusters or in a more complex multi-cloud, multi-data center environment.

Today, we are excited to announce the release of Alluxio Enterprise AI 3.2! Alluxio 3.2 incorporates extensive user feedback from iterations with leading companies building their own AI platforms. Highlights include significant performance enhancements with new checkpoint write support, expanded cache management options, and support for FSSpec interface for integration with the Python ecosystem. Additionally, these advancements enable organizations to adopt Alluxio on their existing data lake as an alternative solution to investing in HPC storage infrastructure.

New Features And Enhancements In Version 3.2

Leverage GPUs Anywhere with 97%+ GPU Utilization

In this version, we've significantly improved performance, allowing users to train their models with reading datasets and checkpoints up to 10GB/s throughput and 200K IOPS using a single Alluxio Fuse client. This enhancement is more than sufficient to handle most AI model training scenarios, especially with advanced setups using 8 A100 GPUs on a single node, eliminating any concerns about I/O bottlenecks. We achieved 97%+ GPU utilization of 20 GPUs when running the MLPerf benchmark, including both 3D-Unet and BERT benchmarks across various GPU (NVIDIA A100) configurations, showcasing superior scalability and efficiency in handling data-intensive workloads.

These performance advancements, combined with Alluxio’s unified namespace feature, which simplifies data access across various storage systems, allow organizations to scale their AI platforms without being constrained by data locality or GPU availability. With Alluxio, organizations can build and run AI training and serving workloads wherever GPUs are available, bridging the gap in adopting hybrid and multi-cloud infrastructures.

We’ve put together this simple tutorial to find out your GPU utilization rate in a few clicks: https://www.alluxio.io/gpu-test-tool/

New Checkpoint Write Support

We've implemented enhancements for handling large checkpoints. Users can now quickly write checkpoints to the local disk and Alluxio will subsequently upload these checkpoints to the cold persistent layer. This feature eliminates the need to wait for checkpoints to be written back to a slow persistent layer, thereby preventing GPU idle time. This is particularly beneficial for large language models and recommendation systems.

Effective Cache Administration

Cache management: Having exact control over cache utilization is critical to maximizing the efficiency of varying workloads. The following commands and configurations provide the flexibility needed to adapt to different scenarios:

- Cache preloading: Prepopulate the cache to avoid cold reads before starting a workload

- Cache eviction: Both passive eviction behaviors and manually triggered commands to make space for new data to be cached

- Cache filtering: Set rules based on file paths to determine if data should be permanently cached, never cached, or cached with an expiry time.

Manage cache via REST API:

Our enriched management REST API now allows for easy lifecycle management of cache space. In this release, you can integrate Alluxio with your control plane via REST API to issue commands to preload data, free data from cache, or set eviction configurations.

Kubernetes management enhancement: Support rolling upgrades and autoscaling for the Alluxio cluster to minimize downtime for workflows while updating the cluster. In the intermittent event that a client is unable to communicate with the cluster during the update process, clients can fallback to the UFS (Under File System) to directly retrieve data, preventing application failures due to I/O errors.

Introducing Alluxio FSSpec Python Filesystem Interface

The Alluxio FSSpec Python API (alluxiofs), an implementation of Filesystem Spec (FSSpec), allows applications to seamlessly interact with various storage backends using a unified Python filesystem interface. Python applications can seamlessly and easily adopt Alluxio Enterprise AI with this new API, simplifying integration and enhancing compatibility. This new interface allows popular Python-based compute frameworks, like Ray, to effortlessly integrate Alluxio to access both local and remote storage systems. This is particularly beneficial for data-intensive applications and AI training workloads where large datasets need quick and repeated access. The addition of the FSSpec interface extends Alluxio’s integration with the Python ecosystem.

Leverage Existing Data Lake Over Investing In HPC Storage Infrastructure

In this release, we tested Alluxio against HPC storage infrastructure market alternatives through robust MLPerf performance benchmarks and the results show that Alluxio provides comparable end-to-end performance. With infrastructure costs in mind, platform teams can leverage Alluxio with existing data lake resources rather than investing in additional HPC storage infrastructure.

Watch this video to see what's new in Alluxio Enterprise AI with a live demo:

https://www.youtube.com/watch?v=m1pAGdZQr6E

Try Alluxio Enterprise AI Today

With the Alluxio Enterprise AI 3.2 release, we have significantly improved performance, cost-efficiency, ease of use and cache management capabilities.

We invite you to learn more and try it today:

Download free trial: https://www.alluxio.io/download/
Watch the replay of the webinar: https://www.alluxio.io/resources/videos/whats-new-in-alluxio-enterprise-ai-3-2-leverage-gpu-anywhere-pythonic-filesystem-api-write-checkpointing-and-more/
Follow this simple tutorial to find out your GPU utilization rate: https://www.alluxio.io/gpu-test-tool/
Read the documentation: https://docs.alluxio.io/ee-ai/user/stable/en/Overview.html

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo