This whitepaper introduces how to speed up end-to-end distributed training in the cloud using Alluxio to accelerate data access. With the help of Alluxio, loading data from cloud storage, training and caching data can be done in a transparent and distributed way as a part of the training process. This whitepaper also demonstrates how to set up and benchmark the end-to-end performance of the training process, along with a comparison of other popular approaches.
Many companies have leveraged Alluxio to level up their current Presto platform, including Facebook, TikTok, Electronic Arts, Walmart, Tencent, Comcast, and more. They have gained significant benefits with Alluxio integrated into their Presto stack.
Alluxio started as a virtual distributed file system, a research project out of the AMPLab at U.C. Berkeley. Alluxio foresaw the need for agility when accessing large data stores separated from compute engines like Hadoop or Spark.
Fast forward several years and over a thousand committers later, and Alluxio has blossomed into the industry’s leading data orchestration platform for analytics and AI/ML. But as with any new type of technology, figuring out the best ways to use it depends on your data environment, computational workloads, issues, and goals.
Applications like Tensorflow, PyTorch can access data through Alluxio FUSE service without modifying any code just like accessing their local file systems by Unix/Linux POSIX API. This article describes the design and implementation of Alluxio FUSE service, its current status and future plans.
This whitepaper details how to evaluate Alluxio’s data orchestration platform as a distributed cache for Apache Spark in a public cloud or on-premises. We discuss best practices and benchmarking results with a combination of standard industry benchmarking suites, such as TPC-DS and HiBench, on cloud storage.
This article presents the collaborative work of Alibaba, Alluxio, and Nanjing University in tackling the problem of Artificial Intelligence and Deep Learning model training in the cloud. We adopted a hybrid solution with a data orchestration layer that connects private data centers to cloud platforms in a containerized environment. Various performance bottlenecks are analyzed with detailed optimizations of each component in the architecture.
This article describes how Alluxio accelerates the training of deep learning models in a hybrid cloud environment with Intel’s Analytics Zoo open source platform, powered by oneAPI. Details on the new architecture and workflow, as well as Alluxio’s performance benefits and benchmarks results will be discussed.