Products
On-Demand Videos
video
AI/ML Infra Meetup | Open Source Michelangelo: Uber's Predictive to Generative end to end ML Lifecycle management platform

In this talk, Eric Wang, Senior Staff Software Engineer introduces Uber’s open-source generative end-to-end ML lifecycle management platform: Michelangelo.
video
AI/ML Infra Meetup | Unlock the Future of Generative AI: TorchTitan's Latest Breakthroughs

In this talk, Jiani Wang, Software Engineer Meta's Pytorch Team, dives into the overview and the latest advancements in TorchTitan.
video
AI/ML Infra Meetup | Bringing Data to GPUs Anywhere + Get Low-Latency on Object Store with Alluxio

In this talk, Bin Fan, VP of Technology at Alluxio, explores how to enable efficient data access across distributed GPU infrastructure, achieving low-latency performance for feature stores and RAG workloads.
.png)
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
video
Running Presto with Alluxio on Amazon EMR
ALLUXIO COMMUNITY OFFICE HOUR
Many organizations are leveraging EMR to run big data analytics on public cloud. However, reading and writing data to S3 directly can result in slow and inconsistent performance. Alluxio is a data orchestration layer for the cloud, and in this use case it caches data for S3, ensuring high and predictable performance as well as reduced network traffic.
In this office hour, you will learn about:
- How to set up Alluxio with the EMR stack so that Presto jobs can seamlessly read from and write to S3
- Compare the performance between Presto on EMR with Presto and Alluxio on EMR
- Open Session for discussion on any topics such as solving the separation of compute and storage problem, and more
Large Scale Analytics Acceleration
video
Tech Talk: Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
With the advent of the public clouds and data increasingly siloed across many locations — on premises and in the public cloud — enterprises are looking for more flexibility and higher performance approaches to analyze their structured data.
Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about:
- The architecture of Presto, an open source distributed SQL engine
- How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics
- Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted
Large Scale Analytics Acceleration
video
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, NFS, DC OS, & Alluxio
feat. Apple Case Study using Tensorflow, NFS, DC/OS, and Alluxio
ALLUXIO ONLINE MEETUP
Data scientists or platform engineers often face the following challenge when the input data for machine learning jobs are stored in remote storage like NFS or cloud storage like S3. Making direct data access is slow, unstable and expensive; manually duplicating data to the training clusters also introduces large overhead, complicated data curation and often requires engineers to build ETL pipelines.
This talk will guide the audience on how Alluxio can greatly simplify the data preparation phase in with remote and possibly multiple data sources. We will share the lessons and benchmark from Bill Zhao an engineer led in Apple when building a Machine Learning platform using Tensorflow, NFS, DC/OS and Alluxio.
In this online meetup, you will learn about:
- When Alluxio can help for machine learning platform;
- How to setup and create POSIX endpoint for Alluxio service to unify the file system data access to S3, HDFS and Azure blob storage;
- How to run TensorFlow to train models backed by accessing remote input data like access local file system.
Model Training Acceleration
video
Community Office Hour: Hands-on with Alluxio Structured Data Management
ALLUXIO COMMUNITY OFFICE HOUR
Users deploy Alluxio in a wide range of use cases from analytics to AI platforms, for Alluxio’s unified access to data and transparent caching for acceleration. However, many frameworks are SQL engines, like Presto, Apache Spark SQL, or Apache Hive, and consume data structured as tables of rows and columns. Since Alluxio is commonly used as a filesystem of files and directories, there is a mismatch between how Alluxio exposes data (files, directories), and how SQL engines deal with data (tables, rows, columns). This gap creates various challenges and inefficiencies.
Therefore, in the Alluxio 2.1 release, we introduce Alluxio Structured Data Management, which is a new set of services that enables structured data applications to interact with data more efficiently. The new services include the catalog service and a transformation service, which all work together to bridge the gap between storage and SQL engines and enable physical data independence.
In this office hour, we introduce the concepts and components of Alluxio Structured Data Management, and go through a demo with Presto.
In this Office Hour we’ll go over:
- Introduction and motivation of Alluxio Structured Data Management
- Overview of the different services of Alluxio Structured Data Management in Alluxio 2.1
- A demo of using Alluxio Structured Data Management with Presto
Large Scale Analytics Acceleration
video
Community Office Hour: Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
ALLUXIO COMMUNITY OFFICE HOUR
While adoption of the Cloud & Kubernetes has made it exceptionally easy to scale compute, the increasing spread of data across different systems and clouds has created new challenges for data engineers. Effectively accessing data from AWS S3 or on-premises HDFS becomes harder and data locality is also lost – how do you move data to compute workers efficiently, how do you unify data across multiple or remote clouds, and many more. Open source project Alluxio approaches this problem in a new way. It helps elastic compute workloads, such as Apache Spark, realize the true benefits of the cloud while bringing data locality and data accessibility to workloads orchestrated by Kubernetes.
One important performance optimization in Apache Spark is to schedule tasks on nodes with HDFS data nodes locally serving the task input data. However, more users are running Apache Spark natively on Kubernetes where HDFS is not an option. This office hour describes the concept and dataflow with respect to using the stack of Spark/Alluxio in Kubernetes with enhanced data locality even if the storage service is outside or remote.
In this Office Hour we’ll go over:
- Why Spark is able to make a locality-aware schedule when working with Alluxio in K8s environment using the host network
- Why a pod running Alluxio can share data efficiently with a pod running Spark on the same host using domain socket and host path volume
- The roadmap to improve this Spark / Alluxio stack in the context of K8s
Large Scale Analytics Acceleration
video
Tech Talk: Integrating Google Cloud Dataproc with Alluxio for faster performance in the cloud
Google Cloud Dataproc is a widely used fully managed Spark and Hadoop service to run big data analytics and compute workloads in the cloud. Services like Dataproc reduce hardware spend, eliminate the need to overbuy capacity, and provide business agility. Yet users still face challenges for performance sensitive workloads or workloads running on remote data.
Alluxio is an open source cloud data orchestration platform that increases performance of analytic workloads running on Dataproc by intelligently caching data and bringing back lost data locality. Alluxio also enables users to run compute workloads against on-prem storage like Hadoop HDFS without any app changes.
Chris Crosbie and Roderick Yao from the Google Dataproc team and Dipti Borkar of Alluxio demo how to set up Google Cloud Dataproc with Alluxio so jobs can seamlessly read from and write to Cloud Storage. They also show how to run Dataproc Spark against a remote HDFS cluster.
Large Scale Analytics Acceleration
video
Tech Talk: The Path to Migrating off MapR
If you’re a MapR user, you might have concerns with your existing data stack. Whether it’s the complexity of Hadoop, financial instability and no future MapR product roadmap, or no flexibility when it comes to co-locating storage and compute, MapR may no longer be working for you.
Alluxio can help you migrate to a modern, disaggregated data stack using any object store with the similar performance of Hadoop plus significant cost savings.
Join us for this tech talk where we’ll discuss how to separate your compute and storage on-prem and architect a new data stack that makes your object store the core. We’ll show you how to offload your MapR/HDFS compute to any object store and how to run all of your existing jobs as-is on Alluxio + object store.
Data Migration
video
Community Office Hour: Improving Memory Utilization of Spark Jobs Using Alluxio
ALLUXIO COMMUNITY OFFICE HOUR
Apache Spark has been widely adopted for in-memory data analytics at scale, however, efficient memory utilization is a common challenge, and users will either run out of memory or experience low and unstable performance. Many Spark users may not be aware of the differences in memory utilization between caching data directly in-memory into the Spark JVM versus storing data off-heap via an in-memory storage service like Alluxio. In this office hour, I will highlight the two approaches with a demo and open up for discussions
In this Office Hour we’ll go over:
- How to run Spark shell with Alluxio such that Spark jobs
- A demo to compare the memory usage between Spark cache and using Alluxio as the external off-heap caching service
- Open Session for discussion on any topics such as running Presto on Alluxio, and more
Large Scale Analytics Acceleration
video
Tech Talk: How the Development Bank of Singapore solves on-prem compute capacity challenges with cloud bursting
The DBS team was tasked to solve their compute capacity problem. They wanted to provide faster insights and analyze data for a range of use cases but didn’t have the ability to scale compute elastically on-prem.
One use case that challenged them was customer call analysis. With the millions of customer calls they get every year, DBS manages over 50TB of customer data and audio files. This data needed to reside on-prem for compliance reasons. With on-prem compute limitations, they looked to the public cloud to analyze this data and selected “zero-copy” bursting as the best approach.
In this tech talk, we’ll discuss why DBS turned to Alluxio’s bursting approach to help solve these challenges. Vitaliy Baklikov, SVP at DBS, will discuss:
- Challenges and inefficiencies with their prior data stack
- Moving to a disaggregated data stack using Alluxio
- Bursting data without persisting in the cloud
- An overview of Alluxio’s “zero-copy” hybrid bursting solution
Large Scale Analytics Acceleration
Hybrid Multi-Cloud
Data Platform Modernization
Data Migration
video
tf.data: TensorFlow Input Pipeline
tf.data is the recommended API for creating TensorFlow input pipelines and is relied upon by countless external and internal Google users. The API enables you to build complex input pipelines from simple, reusable pieces and makes it possible to handle large amounts of data, different data formats, and perform complex transformations. In this talk, I will present an overview of the project and highlight best practices for creating performant input pipelines.
Model Training Acceleration
video
Apache Iceberg – A Table Format for Huge Analytic Datasets
Apache Iceberg is a new format for tracking very large scale tables that are designed for object stores like S3. This talk will include why Netflix needed to build Iceberg, the project’s high-level design, and will highlight the details that unblock better query performance.
Large Scale Analytics Acceleration
video
Orchestrate a Data Symphony
In this keynote, Haoyuan will discuss the key challenges and trends impacting data engineering, and explore the concept of Data Orchestration.
Data Platform Modernization
Hybrid Multi-Cloud