This hands-on session discusses best practices for using PyTorch and Alluxio during model training on AWS. Shawn and Lu provide a step-by-step demonstration of how to use Alluxio on EKS as a distributed cache to accelerate computer vision model training jobs that read datasets from S3. This architecture significantly improves the utilization of GPUs from 30% to 90%+, archives ~5x faster training, and lower cloud storage costs.
Shawn Sun is a Software Engineer at Alluxio. He is an open-source contributor of Alluxio and a PMC member of Fluid. He is currently working on containerization of Alluxio, including the integration of Alluxio and docker, Kubernetes, and CSI. Before joining Alluxio, he received his Master’s degree in Computer Science from Duke University.
Lu Qiu is a machine learning engineer at Alluxio and is a PMC maintainer of the open source project Alluxio. Lu develops big data solutions for AI/ML training. Before that, Lu was responsible for core Alluxio components including leader election, journal management, and metrics management. Lu receives an M.S. degree from George Washington University in Data Science.