AI INFRA DAY | Alluxio

aI INFRA DAY 2023

The Community Event for Developers Building AI Infrastructure at scale

The AI Infra in the Generative AI Era

Bin Fan, Alluxio

Model Lifecycle Management Quality Assurance at Uber Scale

Sally (Mihyong) Lee, Uber

Accelerate Your Model Training and Serving with Distributed Caching

Adit Madan, Alluxio

Composable PyTorch Distributed with PT2 @ Meta

Wanchao Liang, Meta

Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kubernetes

Lu Qiu, Alluxio & Shawn Sun Alluxio

The Generative AI Market And Intel AI Strategy and Product Update

Jordan Plawner, Intel

SPEAKERS

Sally (Mihyoung) Lee @Uber

Senior Staff Engineer, TLM

Wanchao Liang

@Meta

Software Engineer

Jordan Plawner

@Intel

Global Director of Artificial Intelligence Product Management and Strategy

Adit Madan

@Alluxio

Director of Product Management

Bin Fan

@Alluxio

Chief Architect & VP of Open Source

Shawn Sun

@Alluxio

Software Engineer

tarik Bennett

@Alluxio

Sr Solutions Engineer

Lu Qiu

@Alluxio

Machine Learning Engineer

SCHEDULE-AT-A-GLANCE

MORE Program details coming soon.

Times are listed in Pacific Daylight Time (PDT)

As the AI landscape rapidly evolves, the advancements in generative AI technologies, such as ChatGPT, are driving a need for a robust AI infra stack. This opening keynote will explore the key trends of the AI infra stack in the generative AI era.

Speakers:

Bin Fan is Chief Architect and VP of Open Source at Alluxio. Prior to joining Alluxio as a founding engineer, he worked for Google to build the next-generation storage infrastructure. Bin received his PhD in computer science from Carnegie Mellon University on the design and implementation of distributed systems.

Machine learning models power Uber’s everyday business. However, developing and deploying a model is not a one-time event but a continuous process that requires careful planning, execution, and monitoring. This session will highlight Uber’s practice on the machine learning lifecycle to ensure high model quality.

Speakers:

Sally (Mihyoung) Lee serves as the Senior Staff Engineer, TLM at Uber AI Platform. She is experienced in distributed systems, machine learning, and large-scale data analytics. At present, Sally is involved in various initiatives in the Uber AI Platform team. She also has a history of roles at Yahoo. She earned her BS in Computer Engineering from Chungnam National University.

In this session, Adit will present an overview of using distributed caching to accelerate model training and serving. He will explore the requirements of data access patterns in the ML pipeline and offer practical best practices for using distributed caching in the cloud. The session will feature insights from real-world examples, such as AliPay, Zhihu, and more.

Speakers:

Adit Madan is the Director of Product Management at Alluxio. Adit has extensive experience in distributed systems, storage systems, and large-scale data analytics. Adit holds an MS from Carnegie Mellon University and a BS from the Indian Institute of Technology – Delhi. Adit is the Director of Product Management at Alluxio and is also a core maintainer and Project Management Committee (PMC) member of the Alluxio Open Source project.

Explore the technology advancements of PyTorch Distributed, and dive into the details of how multi-dimensional parallelism is made possible to train Large Language Models by composing different PyTorch native distributed training APIs.

Speakers:

Wanchao Liang is a Software Engineer at Meta, PyTorch Team; Tech Lead in PyTorch Distributed training; Author of DTensor, a fundamental distributed abstraction to perform distributed computation. Previously worked on the TorchScript compiler, ONNX.

This hands-on session will discuss best practices for using PyTorch and Alluxio during model training on AWS. Chunxu and Lu will provide a step-by-step demonstration of how to use Alluxio on EKS as a distributed cache to accelerate computer vision model training jobs that read datasets from S3. This architecture significantly improves the utilization of GPUs from 30% to 90%+, archives ~5x faster training, and lower cloud storage costs.

Speakers:

Shawn Sun is a Software Engineer at Alluxio. He is an open-source contributor of Alluxio and a PMC member of Fluid. He is currently working on containerization of Alluxio, including the integration of Alluxio and docker, Kubernetes, and CSI. Before joining Alluxio, he received his Master’s degree in Computer Science from Duke University.

Lu Qiu is a machine learning engineer at Alluxio and is a PMC maintainer of the open source project Alluxio. Lu develops big data solutions for AI/ML training. Before that, Lu was responsible for core Alluxio components including leader election, journal management, and metrics management. Lu receives an M.S. degree from George Washington University in Data Science.

ChatGPT and other massive models represents an amazing step forward in AI, yet they do not solve real-world business problems. We will survey how the AI ecosystem has worked non-stop over this last year to take these all-purpose multi-task models and optimize them to they can be used by organizations to address domain specific problems. We will explain these new AI-for-the-real world techniques and methods such as fine tuning and how can be applied to deliver results which are highly performant with state-of-the-art accuracy while also being economical to build and deploy everywhere to enhance products and services.

Speakers:

Jordan Plawner is Intel’s Global Director of Artificial Intelligence Product Management and Strategy. He is responsible for the development of the Intel’s AI Developer and Product Platform. He works closely with customers and ecosystem partners to understand their AI workload and infrastructure requirements to ensure customers can easily integrate Intel AI and accelerate the time to solution. Jordan is also a member of Intel’s AI leadership team setting business and product strategy, advancing collaboration across business units, and communicating strategy and the customer’s AI journey to management, customers, and conference audiences. Previously, he was responsible for workload acceleration strategy and IP Planning for the Intel Xeon product line and assisted in the development of Intel’s Cloud Service Provider business. Over his 25 years at Intel, Jordan worked in data center technologies and server, Ethernet networking, and data storage products and is responsible for operationalizing complex product roadmap strategies and expanding Intel product offerings into new markets.

CONTACT THE ORGANIZERS