As the AI landscape rapidly evolves, the advancements in generative AI technologies, such as ChatGPT, are driving a need for robust data infrastructures tailored for large language model (LLM) training and inference in the cloud. To effectively leverage the breakthroughs in LLM, organizations must ensure low latency, high concurrency, and scalability in production environments.
In this Alluxio-hosted webinar, Shouwei presented on the design and implementation of a distributed caching system that addresses the I/O challenges of LLM training and inference. He explored the unique requirements of data access patterns and offer practical best practices for optimizing the data pipeline through distributed caching in the cloud. The session featured insights from real-world examples, such as Microsoft, Tencent, and Zhihu, as well as from the open-source community. Watch this recording to get a deeper understanding of how to harness scalable, efficient, and robust data infrastructures for LLM training and inference.
As the AI landscape rapidly evolves, the advancements in generative AI technologies, such as ChatGPT, are driving a need for robust data infrastructures tailored for large language model (LLM) training and inference in the cloud. To effectively leverage the breakthroughs in LLM, organizations must ensure low latency, high concurrency, and scalability in production environments.
In this Alluxio-hosted webinar, Shouwei presented on the design and implementation of a distributed caching system that addresses the I/O challenges of LLM training and inference. He explored the unique requirements of data access patterns and offer practical best practices for optimizing the data pipeline through distributed caching in the cloud. The session featured insights from real-world examples, such as Microsoft, Tencent, and Zhihu, as well as from the open-source community. Watch this recording to get a deeper understanding of how to harness scalable, efficient, and robust data infrastructures for LLM training and inference.
Video:
Presentation slides:
Videos:
Presentation Slides:
Complete the form below to access the full overview:
.png)
Videos
Nilesh Agarwal, Co-founder & CTO at Inferless, shares insights on accelerating LLM inference in the cloud using Alluxio, tackling key bottlenecks like slow model weight loading from S3 and lengthy container startup time. Inferless uses Alluxio as a three-tier cache system that dramatically cuts model load time by 10x.

In this talk, Jingwen Ouyang, Senior Product Manager at Alluxio, will share how Alluxio make it easy to share and manage data from any storage to any compute engine in any environment with high performance and low cost for your model training, model inference, and model distribution workload.

Storing data as Parquet files on cloud object storage, such as AWS S3, has become prevalent not only for large-scale data lakes but also as lightweight feature stores for training and inference, or as document stores for Retrieval-Augmented Generation (RAG). However, querying petabyte-to-exabyte-scale data lakes directly from S3 remains notoriously slow, with latencies typically ranging from hundreds of milliseconds to several seconds.
In this webinar, David Zhu, Software Engineering Manager at Alluxio, will present the results of a joint collaboration between Alluxio and a leading SaaS and data infrastructure enterprise that explored leveraging Alluxio as a high-performance caching and acceleration layer atop AWS S3 for ultra-fast querying of Parquet files at PB scale.
David will share:
- How Alluxio delivers sub-millisecond Time-to-First-Byte (TTFB) for Parquet queries, comparable to S3 Express One Zone, without requiring specialized hardware, data format changes, or data migration from your existing data lake.
- The architecture that enables Alluxio’s throughput to scale linearly with cluster size, achieving one million queries per second on a modest 50-node deployment, surpassing S3 Express single-account throughput by 50x without latency degradation.
- Specifics on how Alluxio offloads partial Parquet read operations and reduces overhead, enabling direct, ultra-low-latency point queries in hundreds of microseconds and achieving a 1,000x performance gain over traditional S3 querying methods.
Speaker: David Zhu
David Zhu is a Software Engineer Manager at Alluxio. At Alluxio, David focuses on metadata management and end-to-end performance benchmarking and optimizations. Prior to that, David completed his Ph.D. from UC Berkeley, with a focus on distributed data management systems and operating systems for the data center. David also holds a Bachelor of Software Engineering from the University of Waterloo.