AI Infra Day Sessions Recap

November 16, 2023

Alluxio, the data platform company for all data-driven workloads, hosted the community event “AI Infra Day” on October 25, 2023. This virtual event brought together technology leaders working on AI infrastructure from Uber, Meta, and Intel, to delve into the intricate aspects of building scalable, performant, and cost-effective AI platforms.

Bin Fan, Alluxio's Chief Architect and VP of Open Source, kicked off the event with welcome remarks shedding light on the pivotal trends shaping the AI infrastructure landscape in the era of generative AI. The key takeaway from his keynote was the importance of AI and machine learning workloads and how quickly they influence the innovation of infrastructure. "As an engineer working on AI infrastructure, you need to catch up very closely with the hardware trend because, there's a saying that, whenever the hardware capacity or performance has a 10x, you will need to totally re-architect your software or your service”, said Bin, “and there is emerging hardware technology that is improving at a constant speed."

Then we delved into a diverse range of topics from model lifecycle management to PyTorch APIs and more. Whether you didn’t get to join us virtually or you just want to rewatch your favorite session, we’ve compiled all of the videos and presentations from AI Infra Day in one place. Drill into the topics most relevant to you, from Generative AI to model fine tuning to Alluxio’s distributed caching features and more.

Model Lifecycle Management Quality Assurance at Uber Scale

Machine learning models power Uber’s everyday business. However, developing and deploying a model is not a one-time event but a continuous process that requires careful planning, execution, and monitoring. In this session, Sally (Mihyong) Lee, Senior Staff Engineer & TLM @ Uber, highlights Uber’s practice on the machine learning lifecycle to ensure high model quality.

Watch On-demand

Accelerate Your Model Training and Serving with Distributed Caching

In this session, Adit Madan, Director of Product Management at Alluxio, presents an overview of using distributed caching to accelerate model training and serving. He explores the requirements of data access patterns in the ML pipeline and offers practical best practices for using distributed caching in the cloud. This session features insights from real-world examples, such as AliPay, Zhihu, and more.

Watch On-demand

Composable PyTorch Distributed with PT2

In this talk, Wanchao Liang, Software Engineer at Meta Pytorch Team, explores the technology advancements of PyTorch Distributed and dives into the details of how multi-dimensional parallelism is made possible to train Large Language Models by composing different PyTorch native distributed training APIs.

Watch On-demand

Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kubernetes

This hands-on session discusses best practices for using PyTorch and Alluxio during model training on AWS. Alluxio’s Shawn Sun (Software Engineer) and Lu Qiu (Machine Learning Engineer) provide a step-by-step demonstration of how to use Alluxio on EKS as a distributed cache to accelerate computer vision model training jobs that read datasets from S3. A benchmark comparing data loading duration for Alluxio Fuse, S3FS Fuse, and S3 Boto3 is also given, where Alluxio Fuse is proved to be 5 times faster than S3FS Fuse and >10 times faster than S3 Boto3. This architecture significantly improves the utilization of GPUs from 30% to 90%+, archives ~5x faster training, and lower cloud storage costs.

Watch On-demand

The Generative AI Market, Intel AI Strategy and Product Update

ChatGPT and other massive models represent an amazing step forward in AI, yet they do not solve real-world business problems. In this session, Jordan Plawner, Global Director of Artificial Intelligence Product Manager and Strategy at Intel, surveys how the AI ecosystem has worked non-stop over this last year to take these all-purpose multi-task models and optimize them so they can be used by organizations to address domain specific problems. He explains these new AI-for-the-real world techniques and methods such as fine tuning and how they can be applied to deliver results which are highly performant with state-of-the-art accuracy while also being economical to build and deploy everywhere to enhance products and services.

Watch On-demand

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo