White Papers

Optimizing I/O for AI Workloads in Geo-Distributed GPU Clusters

Are you struggling with slow data access, managing AI infrastructure at scale, low GPU utilization, or budget constraints in Geo-distributed GPU clusters?

In this insightful white paper, we’ll discuss common causes of slow AI workloads and low GPU utilization, how to diagnose the root cause, and offer solutions to the most common root cause of underutilized GPUs. You'll learn:

Challenges introduced by multi-GPU cluster architecture and the metrics they impact
Diagnosing and common causes of low GPU utilization
How to optimize data loading in order to address the I/O bottlenecks
How Alluxio Distributed Cache solves data loading performance bottlenecks and enables full utilization of GPU resources
A case study of a global e-commerce giant, showcasing how Alluxio accelerates slow and unstable AI/ML training workloads with 20% improvment in GPU utilization and 50% cloud cost reduction

‍

Alluxio Datasheet

AI Platform and Data Infrastructure teams rely on Alluxio Data Acceleration Platform to boost the performance of data-intensive AI workloads, empower ML engineers to build models faster, and lower infrastructure costs.

With high-performance distributed cache architecture as its core, the Alluxio Data Acceleration Platform decouples storage capacity from storage performance, enabling you to more efficiently and cost-effectively grow storage capacity without worrying about performance.

Data Acceleration
Simplicity at Scale
Architected for AI Workload Portability
Lower Infrastructure Costs

In this datasheet, you will learn how Alluxio helps eliminate data loading bottlenecks and maximize GPU utilization for your AI workloads.

‍

Choosing the Right Architecture for Enterprise AI Workloads in Production

AI and machine learning workloads depend on accessing massive datasets to drive model development. However, when project teams attempt to transition pilots to production-level deployments, most discover their existing data architectures struggle to meet the performance demands.

This whitepaper discusses critical architectural considerations for optimizing data access and movement in enterprise-grade AI infrastructure. Discover:

Common data access bottlenecks that throttle AI project productivity as workloads scale
Why common approaches like faster storage and NAS/NFS fall short
How Alluxio serves as a performant and scalable data access layer purpose-built for ML workloads
Reference architecture on AWS and benchmarks test results

‍

Evaluating Apache Spark and Alluxio for Data Analytics

This whitepaper details how to evaluate Alluxio’s data orchestration platform as a distributed cache for Apache Spark in a public cloud or on-premises. We discuss best practices and benchmarking results with a combination of standard industry benchmarking suites, such as TPC-DS and HiBench, on cloud storage. This guide serves as a reference for reproducing similar experiments in your own environment as part of a Proof of Concept (PoC) to evaluate the use of Alluxio with Apache Spark.

Using Alluxio to Optimize and Improve Performance of Kubernetes-Based Deep Learning in the Cloud

This article presents the collaborative work of Alibaba, Alluxio, and Nanjing University in tackling the problem of Artificial Intelligence and Deep Learning model training in the cloud. We adopted a hybrid solution with a data orchestration layer that connects private data centers to cloud platforms in a containerized environment. Various performance bottlenecks are analyzed with detailed optimizations of each component in the architecture. Our goal for this article is to reduce the cost and complexity of data access for Deep Learning training in a hybrid environment in order to advance Deep Learning model training in the cloud.

Alluxio Accelerates Deep Learning in Hybrid Cloud using Intel’s Analytics Zoo open source platform powered by oneAPI

This article describes how Alluxio can accelerate the training of deep learning models in a hybrid cloud environment when using Intel’s Analytics Zoo open source platform, powered by oneAPI. Details on the new architecture and workflow, as well as Alluxio’s performance benefits and benchmarks results will be discussed.

Get Insights Faster with Alluxio and Intel

In today’s data centers, bounded storage and compute resources on Hadoop* or Spark* nodes create challenges around data capacity, data silos, costs, performance, and efficiency. Alluxio’s data orchestration platform, combined with 2nd Gen Intel® Xeon® Scalable processors and Intel® OptaneTM persistent memory, simplifies data management and processing and significantly accelerates performance for today’s big data and AI/ML workloads.

Learn more about Alluxio and Intel’s joint solution, which allows companies to unify on-premises and cloud data silos into a single, cloud-based data layer, increasing data accessibility and elasticity while virtually eliminating the need for copies—for less complexity, lower costs, and greater speed and agility.

“Zero-Copy” Hybrid Cloud for Data Analytics – Strategy, Architecture and Benchmark Report

This whitepaper details how to leverage a public cloud, such as Amazon AWS, Google GCP, or Microsoft Azure to scale analytic workloads directly on data on-premises without copying and synchronizing the data into the cloud. We will show an example of what it might look like to run on-demand Presto and Hive with Alluxio in the public cloud using on-prem HDFS. We will also show how to set up and execute performance benchmarks in two geographically dispersed Amazon EMR clusters along with a summary of our findings.

‍

Accelerating analytics & AI in Kubernetes with Alluxio Open Source Data Orchestration

Kubernetes is widely used across enterprises to orchestrate computation. While Kubernetes helps improve flexibility and portability for computation in public/hybrid cloud environments across infrastructure providers, running data-intensive workloads on it can be challenging.

For data-driven workloads in disaggregated stacks, there’s no native data access layer within a Kubernetes cluster. For query engines and machine learning frameworks that are deployed within a Kubernetes cluster, any critical data sitting outside the cluster breaks locality. Alluxio can help.

‍

Why Data Orchestration?

Large-scale analytics and AI/ML applications require efficient data access, with data increasingly distributed across multiple data stores in private data centers and clouds. Data platform teams also need the flexibility to introduce new data sources and move to new storage options with minimal changes or downtime for their applications. This paper delves further into what’s driving the need for–and what problems are solved with—an Alluxio data orchestration layer as part of a modern data platform.

Accelerating analytics on AWS EMR & AWS S3 with Alluxio in a disaggregated data stack

The AWS EMR service has made it easy for enterprises to bring up a full-featured analytical stack in the cloud that elastically scales based on demand.

The EMR service along with S3 provides a robust yet flexible platform in the cloud with the click of a few buttons, compared to the highly complex and rigid deployment approach required for on-premise Hadoop Data platforms. However, because data on AWS is typically stored in S3, an object store, you lose some of the key benefits of compute frameworks like Apache Spark and Presto that were designed for distributed file systems like HDFS.

In this white paper, we’ll share some of the challenges that arise because of the impedance mismatch between HDFS and S3, the expectations of analytics workloads of the object store, and how Alluxio with EMR addresses them.

Designing for the cloud: What today’s Data Engineer should be considering when building their stack

Cloud has changed the dynamics of data engineering in many ways, from changing expectations of on-demand platform services to the popularity of the object store to the emergence of a flexible, separated data stack. And as a data engineer venturing into this cloudy world, the understanding of specific architectural approaches coupled with knowledge in some data stacks has proven useful.

Instead of being purely focused on data infrastructure, today’s data engineer is now a full stack engineer. Compute, containers, storage, data movement, performance, network – skills are increasingly needed across the broader stack.

This white paper attempts to discuss some design principles as well as high priority elements of the stack that a data engineer should think about.

“Zero-Copy” Hybrid Bursting with no App Changes

This whitepaper details how to leverage any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) to scale analytics workloads directly on on-prem data without copying and synchronizing the data into the cloud. We will show an example of what it might look like to run on-demand Starburst Presto, Spark, and Hive with Alluxio in the public cloud using on-prem HDFS.

The paper also includes a real world case study on a leading hedge fund based in New York City, who deployed large clusters of Google Compute Engine VMs with Spark and Alluxio using on-prem HDFS as the underlying storage tier.

‍

Achieving 10x acceleration of Spark and Hive Jobs on AWS S3 with Alluxio Tiered Storage

The data engineering team at Bazaarvoice, a software-as-a-service digital marketing company based in Austin, Texas, must handle data at massive Internet-scale to serve its customers. Facing challenges with scaling their storage capacity up and provisioning hardware, they turned to Alluxio’s tiered storage system and saw 10x acceleration of their Spark and Hive jobs running on AWS S3.

In this whitepaper you’ll learn:

How to build a big data analytics platform on AWS that includes technologies like Hive, Spark, Kafka, Storm, Cassandra, and more
How to setup a Hive metastore using a storage tier for hot tables
How to leverage tiered storage for maximized read performance

‍

Effective caching of Spark Resilient Distributed Datasets (RDDs) with Alluxio

Organizations like Baidu and Barclays have deployed Alluxio with Spark in their architecture, and have achieved impressive benefits and gains. Recently, Qunar deployed Alluxio with Spark in production and found that Alluxio enables Spark streaming jobs to run 15x to 300x faster. In their blog post, they described how Alluxio improved their system architecture, and mentioned that some existing Spark jobs would slow down or would never finish because they would run out of memory. Aer using Alluxio, those jobs were able to finish, because the data could be stored in Alluxio, instead of within Spark. In this blog, we investigate how Alluxio can make Spark more effective, and discuss various ways to use Alluxio with Spark. Alluxio helps Spark perform faster, and enables multiple Spark jobs to share the same, memory-speed data. We conducted a few simple and controlled experiments with Spark and Alluxio. For these experiments we used Spark version 2.0.0, and Alluxio 1.2.0.

In this article, we show by saving RDDs in Alluxio, Alluxio can keep larger data sets in-memory for faster Spark applications, as well as enable sharing of RDDs across separate Spark applications.