In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scale data for AI. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving.
In this talk, Bin Fan will discuss the challenges of I/O stalls that lead to suboptimal GPU utilization during model training. He will present a reference architecture for running PyTorch jobs with Alluxio in cloud environments, demonstrating how this approach can significantly enhance GPU efficiency.
What you will learn:
- How to identify GPU utilization and I/O-related performance bottlenecks in model training
- Leverage GPU anywhere to maximize resource utilization
- Best practices for monitoring and optimizing GPU usage across training and serving pipelines
Video:
Presentation slides:
Speaker:
Bin Fan is VP of Technology and Founding Engineer at Alluxio. Prior to joining Alluxio as a founding engineer, he worked for Google to build the next-generation storage infrastructure. Bin received his PhD in computer science from Carnegie Mellon University on the design and implementation of distributed systems.