Alluxio AI Infra Day 2024

AI Infra Day | The AI Infra in the Generative AI Era

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Caching

AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale

AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta

AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Update

AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kubernetes

Blog

Blog

New Features in Alluxio Enterprise AI 3.6

Learn about the latest features in Alluxio AI 3.6, including Accelerated AI Cold Starts for inference servers, pushdown parquet query acceleration, and more!

‍

White Paper

White Paper

Optimizing I/O for AI Workloads in Geo-Distributed GPU Clusters

Building a reliable, high-performance AI/ML infrastructure can be challenging, especially with constrained budget in a multi-GPU world: infrastructure teams have to leverage GPUs wherever they are available. This requires moving data across regions and clouds, which further leads to slow, complex, and expensive remote data access challenges. This white paper introduces the common causes of slow AI workloads and low GPU utilization, how to diagnose the root cause, and offer solutions to the most common root cause of underutilized GPUs.

‍

White Paper

White Paper

Meet in the Middle for a 1,000x Performance Boost Querying Parquet Files on Petabyte-Scale Data Lakes

This article introduces how to leverage Alluxio as a high-performance caching and acceleration layer atop hyperscale data lakes for queries on Parquet files. Without using specialized hardware, changing data formats or object addressing schemes, or migrating data from data lakes, Alluxio delivers sub-millisecond Time-to-First-Byte (TTFB) performance comparable to AWS S3 Express One Zone. Furthermore, Alluxio’s throughput scales linearly with cluster size; a modest 50-node deployment can achieve one million queries per second, surpassing the single-account throughput of S3 Express by 50× without latency degradation.

Blog

Blog

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

In a recent Alluxio-hosted virtual tech talk, Hyun Jun Baek, Staff Backend Engineer at Coupang, presented "How Coupang Leverages Distributed Cache to Accelerate ML Model Training." This blog post summarizes key insights from Hyun's presentation on Coupang's approach to distributed caching and how it has transformed their multi-region, hybrid cloud machine learning platform.

‍

On Demand Videos

On Demand Videos

Tech Talk: How Coupang Leverages Distributed Cache to Accelerate ML Model Training

In this tech talk, Hyun Jung Baek, Staff Backend Engineer at Coupang, will share best practices for leveraging distributed caching to power search and recommendation model training infrastructure.

On Demand Videos

On Demand Videos

Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distributed Storage

Blog

Blog

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

‍

Blog

Blog

AI/ML Infra Meetup at Uber Seattle: Tackling Scalability Challenges of AI Platforms

Co-hosted by Alluxio and the Uber AI team on March 6, 2025, at Uber's Seattle office and via Zoom, the AI/ML Infra Meetup is a community event for developers focused on building AI, ML, and data infrastructure at scale. Speakers from Uber, Snap, and Alluxio delivered talks, sharing insights and real-world examples about LLM training, fine-tuning, deployment, designing scalable architectures, GPU optimization, and building recommendations systems.

‍

On Demand Videos

On Demand Videos

AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale

On Demand Videos

On Demand Videos

AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, Pretraining, & Inference at Scale

On Demand Videos

On Demand Videos

AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune

On Demand Videos

On Demand Videos

AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendation Applications

On Demand Videos

On Demand Videos

What’s New in Alluxio AI: 3X Faster Checkpoint File Creation, New Cache Eviction Policies, Python SDK enhancements, and more

Case Study

Case Study

RedNote Accelerates Model Training & Distribution with Alluxio

By leveraging Alluxio Distributed Cache, RedNote eliminated storage bottlenecks causing model training time to exceed SLAs, accelerated cross-cloud model distribution, and lowered model distribution costs

Case Study

Case Study

Search and Recommendation AI Model Training Acceleration for Top 10 Global E-commerce Giant

Publicly traded, top 10 global e-commerce company leverages Alluxio Enterprise AI to accelerates training of search and recommendation AI Model with Alluxio and cut AWS S3 API and egress charges by over 50%.

‍