AI/ML Infra Meetup at Uber

The Community Event For Developers Building AI/ML/Data Infrastructure At Scale

Thursday May 23, 2024 | Uber’s Sunnyvale Office & Virtual

AI/ML INFRA MEETUP @ UBER

Join leading AI/ML infrastructure experts for the AI/ML Infra Meetup hosted by Alluxio and Uber. This is a premier opportunity to engage and discuss the latest in ML pipeline, AI/ML infrastructure, LLM, RAG, GPU, PyTorch, HuggingFace and more.

​​This meetup will be in person at Uber Sunnyvale and live-streamed. Experts from Uber, NVIDIA, Alluxio and UChicago will give talks and share insights and real-world examples about optimizing data pipelines, accelerating model training and serving, designing scalable architectures, and more.

​​Immerse yourself with learning, networking, and conversations. Enjoy the mix and mingle happy hour in the end. Dinner and drinks are on us!

SPEAKERS

Qiushen Wang

@Uber

Sr Staff Software Engineer

Xiande Cao

@NVIDIA

Deep Learning Software Engineer Manager

Junchen Jiang

@University of chicago

Assistant Professor of Computer Science

Lu Qiu

@Alluxio

Tech Lead

Siyuan Sheng

@Alluxio

Sr Software Engineer

Tarik Bennett

@Alluxio

Sr Solutions Engineer

SCHEDULE-AT-A-GLANCE

See You Soon!

Times are listed in Pacific Daylight Time (PDT). The agenda is subject to change.

4:00pm – 5:00pm Registration & Networking

Uber has numerous deep learning models, most of which are highly complex with many layers and a vast number of features. Understanding how these models work is challenging and demands significant resources to experiment with various training algorithms and feature sets. With ML explainability, the ML team aims to bring transparency to these models, helping to clarify their predictions and behavior. This transparency also assists the operations and legal teams in explaining the reasons behind specific prediction outcomes.

In this talk, Eric Wang will discuss the methods Uber used for explaining deep learning models and how we integrated these methods into the Uber AI Michelangelo ecosystem to support offline explaining.

Speakers:
Eric (Qiushen) Wang is a software engineer at Uber’s Michelangelo team since 2020, focused on maintaining high ML quality across all models and pipelines. Prior to this, he contributed to Uber’s Marketplace Fares team from 2018 to 2020, developing fare systems for various services. Before that, he resided in Australia, and built a strong foundation in software engineering, working with notable companies including eBay, Qantas, and Equifax.

Speed and efficiency are two requirements for the underlying infrastructure for machine learning model development. Data access can bottleneck end-to-end machine learning pipelines as training data volume grows and when large model files are more commonly used for serving. For instance, data loading can constitute nearly 80% of the total model training time, resulting in less than 30% GPU utilization. Also, loading large model files for deployment to production can be slow because of slow network or storage read operations. These challenges are prevalent when using popular frameworks like PyTorch, Ray, or HuggingFace, paired with cloud object storage solutions like S3 or GCS, or downloading models from the HuggingFace model hub.

In this presentation, Lu and Siyuan will offer comprehensive insights into improving speed and GPU utilization for model training and serving. You will learn:

  • The data loading challenges hindering GPU utilization
  • The reference architecture for running PyTorch and Ray jobs while reading data from S3, with benchmark results of training ResNet50 and BERT
  • Real-world examples of boosting model performance and GPU utilization through optimized data access
Speakers:
Lu Qiu is a Data & AI Platform Tech Lead at Alluxio and is a PMC maintainer of the open source project Alluxio. Lu develops big data solutions for AI/ML training. Before that, Lu was responsible for core Alluxio components including leader election, journal management, and metrics management. Lu receives an M.S. degree from George Washington University in Data Science.

Siyuan Sheng is a senior software engineer at Alluxio. Previously, he has worked as a Software engineer in Rubrik’s Appflows team. Siyuan received his MS of Computer Science from CMU. He also loves snowboarding during his spare time.

Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.
Speakers:
Junchen Jiang is an Assistant Professor of Computer Science at the University of Chicago. He received his Ph.D. from CMU in 2017 and his bachelor’s degree from Tsinghua in 2011. His research interests are networked systems and their intersections with machine learning. He has received a Google Faculty Research Award, an NSF CAREER Award, and a CMU Computer Science Doctoral Dissertation Award. https://people.cs.uchicago.edu/~junchenj/

From Caffe to MXNet, to PyTorch, and more, Xiande Cao, Senior Deep Learning Software Engineer Manager, will share his perspective on the evolution of deep learning frameworks.
Speakers:
Dr. Xiande (Triston) Cao is a Senior Deep Learning Software Engineering Manager at NVIDIA. He collaborates with the open-source community working on deep learning and graph neural networks, leveraging the NVIDIA software stack, GPUs, and AI systems to enhance the capabilities of AI. He received his PhD in Electric Engineering from the University of Kentucky.

6:20pm – 7:30pm Happy Hour | Food and drinks are on us!