New features drastically improve I/O efficiency for data loading and preprocessing stages of an AI/ML training pipeline to reduce end-to-end training time and costs
SAN MATEO, CA – November 16, 2021 - Alluxio, the developer of open source data orchestration software for large-scale workloads, today announced the immediate availability of version 2.7 of its Data Orchestration Platform. This new release has led to 5x improved I/O efficiency for Machine Learning (ML) training at significantly lower cost by parallelizing data loading, data preprocessing and training pipelines. Alluxio 2.7 also provides enhanced performance insights and support for open table formats like Apache Hudi and Iceberg to more easily scale access to data lakes for faster Presto and Spark-based analytics.
Also today, in a separate announcement, Alluxio announced $50M in Series C financing https://bit.ly/3mOuiYj
“Alluxio 2.7 further strengthens Alluxio’s position as a key component for AI, Machine Learning, and deep learning in the cloud,” said Haoyuan Li, Founder and CEO, Alluxio. “With the age of growing datasets and increased computing power from CPUs and GPUs, machine learning and deep learning have become popular techniques for AI. This rise of these techniques advances the state-of-the-art for AI, but also exposes some challenges for the access to data and storage systems.”
"We deployed Alluxio in a cluster of 1000 nodes to accelerate the data preprocessing of model training on our game AI platform. Alluxio has proven to be stable, scalable and manageable,” said Peng Chen, Engineer Manager in the big data team at Tencent. “As more and more big data and AI applications are containerized, Alluxio is becoming the top choice for large organizations as an intermediate layer to accelerate data analytics and model training."
“Data teams with large-scale analytics and AI/ML computing frameworks are under increasing pressure to make a growing number of data sources more easily accessible, while also maintaining performance levels as data locality, network IO, and rising costs come into play,” said Mike Leone, Analyst, ESG. “Organizations want to use more affordable and scalable storage options like cloud object stores, but they want peace of mind knowing they don’t have to make costly application changes or experience new performance issues. Alluxio is helping organizations address these challenges by abstracting away storage details while bringing data closer to compute, especially in hybrid cloud and multi-cloud environments.”
Alluxio 2.7 Community and Enterprise Edition features new capabilities, including:
Alluxio and NVIDIA’s DALI for ML
NVIDIA’s Data Loading Library (DALI) is a commonly used python library which supports CPU and GPU execution for data loading and preprocessing to accelerate deep learning. With release 2.7, the Alluxio platform has been optimized to work with DALI for python-based ML applications which include a data loading and preprocessing step as a precursor to model training and inference. By accelerating I/O heavy stages and allowing parallel processing of the following compute intensive training, end-to-end training on the Alluxio data platform achieves significant performance gains over traditional solutions. The solution is scale-out as opposed to other solutions suitable for smaller data set sizes.
Data Loading at Scale
At the heart of Alluxio’s value proposition is data management capabilities complimenting caching and unification of disparate data sources. As the use of Alluxio has grown for compute and storage spanning multiple geographical locations, the software continues to evolve to keep scaling using a new technique for batching data management jobs. Batching jobs, performed using an embedded execution engine for tasks such as data loading, reduces the resource requirements for the management controller lowering cost of provisioned infrastructure.
Ease of Use on Kubernetes
Alluxio now supports a native Container Storage Interface (CSI) Driver for Kubernetes, as well as a Kubernetes operator for ML making it easier than ever before to operate ML pipelines on the Alluxio platform in containerized environments. The Alluxio volume type is now natively available for Kubernetes environments. Agility and ease-of-use are a constant focus in this release.
Insight Driven Dynamic Cache Sizing for Presto
An intelligent new capability, called Shadow Cache, makes striking the balance between high performance and cost easy by dynamically delivering insights to measure the impact of cache size on response times. For multi-tenant Presto environments at scale, this new feature significantly reduces the management overhead with self-managing capabilities.
“Data platform teams utilize Alluxio to streamline data pre-processing and loading phases in a world where storage is separated from ML computation,” said Adit Madan, Senior Product Manager, Alluxio. “This simplicity enables maximum utilization of GPUs with frameworks such as Spark ML, Tensorflow and PyTorch.The Alluxio solution is available on multiple cloud platforms such as AWS, GCP, and Azure Cloud, and now also on Kubernetes in private data centers or public clouds.”
Availability
Free downloads of Alluxio 2.7 open source Community Edition and of Alluxio Enterprise Edition are generally available here: https://www.alluxio.io/download/
Resources
- To learn more about the Alluxio 2.7 release, read the product blog. https://www.alluxio.io/blog/whats-new-in-alluxio-2-7/Overview of Alluxio Use Cases can be found here
- Watch talks from Facebook, Uber, and other users from recent Alluxio Day
- For general information about Alluxio, visit https://www.alluxio.io.
Tweet this: @Alluxio boosts AI/ML support for its hybrid and multi-cloud Data Orchestration Platform #analytics #AI #DataOrchestration https://bit.ly/3GSwYft
About Alluxio
Alluxio, a leading provider of the high performance data platform for analytics and AI,
accelerates time-to-value of data and AI initiatives and maximizes infrastructure ROI. Uniquely
positioned at the intersection of compute and storage systems, Alluxio has a universal view of
workloads on the data platform across stages of a data pipeline. This enables Alluxio to provide
high performance data access regardless of where the data resides, simplify data engineering,
optimize GPU utilization, and reduce cloud and storage costs. With Alluxio, organizations can
achieve magnitudes faster model training and serving without the need for specialized storage,
and build AI infrastructure on existing data lakes. Backed by leading investors, Alluxio powers
technology, internet, financial services, and telecom companies, including 9 out of the top 10
internet companies globally. To learn more, visit www.alluxio.io.
Media Contact:
Beth Winkowski
Winkowski Public Relations, LLC for Alluxio
978-649-7189
beth@alluxio.com
News & Press
The Global Data Center Market achieved a valuation of $196.9 Billion in 2023. It is projected to exhibit steady growth, reaching $464.6 Billion by 2032, with a compound annual growth rate (CAGR) of 10.30% during the forecast period (2024–2032). However, resolving security, operational efficiency, and environmental impact issues will be critical to continuing this growth trajectory, reports Straits Research.
Here, experts in the field offer their predictions for what 2025 holds for data centers