Alluxio AI/ML Infra Meetup Series

video

AI/ML Infra Meetup | Open Source Michelangelo: Uber's Predictive to Generative end to end ML Lifecycle management platform

Eric (Qiushen) Wang

Software Engineer

Uber

Watch now

video

AI/ML Infra Meetup | Unlock the Future of Generative AI: TorchTitan's Latest Breakthroughs

Jiani Wang

Software Engineer @ Meta

Watch now

video

AI/ML Infra Meetup | Bringing Data to GPUs Anywhere + Get Low-Latency on Object Store with Alluxio

Bin Fan

VP of Technology

Alluxio

Watch now

video

AI/ML Infra Meetup | SkyPilot: Open-source System to Scale AI across Clusters, Hyperscalers, and Neoclouds

Zongheng Yang

Co-Creator of SkyPilot

Watch now

video

AI/ML Infra Meetup | AI at scale Architecting Scalable, Deployable and Resilient Infrastructure

Pratik Mishra

Sr. Staff-Research @ AMD

In this session, Pratik Mishra delivers insights on architecting scalable, deployable, and resilient AI infrastructure at scale. His discussion on fault tolerance, checkpoint optimization, and the democratization of AI compute through AMD's open ecosystem resonated strongly with the challenges teams face in production ML deployments.

Watch now

video

AI/ML Infra Meetup | Alluxio + S3 A Tiered Architecture for Latency-Critical, Semantically-Rich Workloads

Bin Fan

VP of Technology

Alluxio

In this talk, Bin Fan, VP of Technology at Alluxio, presents on building tiered architectures that bring sub-millisecond latency to S3-based workloads. The comparison showing Alluxio's 45x performance improvement over S3 Standard and 5x over S3 Express One Zone demonstrated the critical role the performance & caching layer plays in modern AI infrastructure.

Watch now

video

AI/ML Infra Meetup | Achieving Double-Digit Millisecond Offline Feature Stores with Alluxio

Greg Lindstrom

VP of ML Trading @ Blackout Power Trading

Watch now

video

AI/ML Infra Meetup | Building AI Applications on Zoom

Ojus Save

DevRel & Partnerships @ Zoom

Watch now

video

AI/ML Infra Meetup | Accelerating the Data Path to the GPU for AI and Beyond

Sandeep Joshi

Senior Manager @ NVIDIA

Watch now

video

AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access

Bin Fan

VP of Technology

Alluxio

Watch now

video

AI/ML Infra Meetup | LLM Agents and Implementation Challenges

Pritish Udgata

Director - AI & MarTech Products @ Adobe

Watch now

video

AI/ML Infra Meetup | Best Practice for LLM Serving in the Cloud

Nilesh Agarwal

Co-founder & CTO @ Infereless

Nilesh Agarwal, Co-founder & CTO at Inferless, shares insights on accelerating LLM inference in the cloud using Alluxio, tackling key bottlenecks like slow model weight loading from S3 and lengthy container startup time. Inferless uses Alluxio as a three-tier cache system that dramatically cuts model load time by 10x.

Watch now

video

AI/ML Infra Meetup | From Data Preparation to Inference: How Alluxio Speeds Up AI

Jingwen Ouyang

Senior Product Manager

Alluxio

Watch now

video

AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendation Applications

Xu Ning

Director of Engineering

AI Platform @ Snap

Watch now

video

AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune

Chongxiao Cao

Senior SWE

Uber

Watch now

video

AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, Pretraining, & Inference at Scale

Bin Fan

VP of Technology

Alluxio

Watch now

video

AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale

Sean Po

Staff SWE

Uber

Tse-Chi Wang

Senior SWE

Uber

Watch now

video

AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU Workloads in the Cloud

Bin Fan

VP of Technology

Alluxio

Watch now

video

AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack

Junchen Jiang

Assistant Professor of Computer Science

University of Chicago

Watch now

video

AI/ML Infra Meetup | Three Developments in AI Infra

Robert Nishihara

Co-Founder

Anyscale

Watch now

video

AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI

Zhe Zhang

Distinguished Engineer

NVIDIA

In this talk, Zhe Zhang (NVIDIA, ex-Anyscale) introduced Ray and its applications in the LLM and multi-modal AI era. He shared his perspective on ML infrastructure, noting that it presents more unstructured challenges, and recommended using Ray and Alluxio as solutions for increasingly data-intensive multi-modal AI workloads.

Watch now

video

AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training with NVMe GDS and RDMA

Bin Fan

VP of Technology

Alluxio

As large-scale machine learning becomes increasingly GPU-centric, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization. In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.

Watch now

video

AI/ML Infra Meetup | Big Data and AI

Sandeep Manchem

ML Platform Engineering Manager

Zoom

Watch now

video

AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for production ready LLM pre-training

Tianyu Liu

Research Scientist

Meta

TorchTitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is a repo that showcases PyTorch's latest distributed training features in a clean, minimal codebase.

In this talk, Tianyu will share TorchTitan’s design and optimizations for the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its performance, composability, and scalability.

‍

Watch now

video

AI/ML Infra Meetup | OpenAI: Preference Tuning and Fine Tuning LLMs

Ankit Khare

Developer Experience Engineer

Open AI

Watch now

video

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & Serving

Lu Qiu

Data & AI Platform Tech Lead

Alluxio

Speed and efficiency are two requirements for the underlying infrastructure for machine learning model development. Data access can bottleneck end-to-end machine learning pipelines as training data volume grows and when large model files are more commonly used for serving. For instance, data loading can constitute nearly 80% of the total model training time, resulting in less than 30% GPU utilization. Also, loading large model files for deployment to production can be slow because of slow network or storage read operations. These challenges are prevalent when using popular frameworks like PyTorch, Ray, or HuggingFace, paired with cloud object storage solutions like S3 or GCS, or downloading models from the HuggingFace model hub.

In this presentation, Lu and Siyuan will offer comprehensive insights into improving speed and GPU utilization for model training and serving. You will learn:

The data loading challenges hindering GPU utilization
The reference architecture for running PyTorch and Ray jobs while reading data from S3, with benchmark results of training ResNet50 and BERT
Real-world examples of boosting model performance and GPU utilization through optimized data access

Watch now

video

AI/ML Infra Meetup | Perspective on Deep Learning Framework

Xiande (Triston) Cao

Senior Deep Learning Software Engineer

NVIDIA

Watch now

video

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Junchen Jiang

Assistant Professor of Computer Science

University of Chicago

Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.

Watch now

video

AI/ML Infra Meetup | ML explainability in Michelangelo

Eric (Qiushen) Wang

Software Engineer

Uber

Uber has numerous deep learning models, most of which are highly complex with many layers and a vast number of features. Understanding how these models work is challenging and demands significant resources to experiment with various training algorithms and feature sets. With ML explainability, the ML team aims to bring transparency to these models, helping to clarify their predictions and behavior. This transparency also assists the operations and legal teams in explaining the reasons behind specific prediction outcomes.

In this talk, Eric Wang will discuss the methods Uber used for explaining deep learning models and how we integrated these methods into the Uber AI Michelangelo ecosystem to support offline explaining.

Watch now

Alluxio AI/ML Infra Meetup Series

‍

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer