Products
On-Demand Videos
video
AI/ML Infra Meetup | Open Source Michelangelo: Uber's Predictive to Generative end to end ML Lifecycle management platform

In this talk, Eric Wang, Senior Staff Software Engineer introduces Uber’s open-source generative end-to-end ML lifecycle management platform: Michelangelo.
video
AI/ML Infra Meetup | Unlock the Future of Generative AI: TorchTitan's Latest Breakthroughs

In this talk, Jiani Wang, Software Engineer Meta's Pytorch Team, dives into the overview and the latest advancements in TorchTitan.
video
AI/ML Infra Meetup | Bringing Data to GPUs Anywhere + Get Low-Latency on Object Store with Alluxio

In this talk, Bin Fan, VP of Technology at Alluxio, explores how to enable efficient data access across distributed GPU infrastructure, achieving low-latency performance for feature stores and RAG workloads.
.png)
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
video
Data Orchestration for AI, Big Data, and Cloud
The data ecosystem has heavily evolved over the past two decades. There’s been an explosion of data-driven frameworks, such as Presto, Hive, and Spark to run analytics and ETL queries and TensorFlow and PyTorch to train and serve models. On the data side, the approach to managing and storing data has evolved from HDFS to cheaper, more scalable and separated services typified by cloud stores like AWS S3. As a result, data engineering has become increasingly complex, inefficient, and hard, particularly in hybrid and cloud environments.
Haoyuan Li offers an overview of a data orchestration layer that provides a unified data access and caching layer for single cloud, hybrid, and multicloud deployments. It enables distributed compute engines like Presto, TensorFlow, and PyTorch to transparently access data from various storage systems (including S3, HDFS, and Azure) while actively leveraging an in-memory cache to accelerate data access.
Hybrid Multi-Cloud
Data Platform Modernization
Large Scale Analytics Acceleration
Model Training Acceleration
Cloud Cost Savings
video
Community Office Hour: Accelerating Hive with Alluxio on S3
Many organizations are leveraging Hive to run big data analytics on public cloud. However, reading and writing data to S3 directly can result in slow and inconsistent performance. Alluxio is a data orchestration layer for the cloud, and in this use case it caches data for S3, ensuring high and predictable performance as well as reduced network traffic.
In this Office Hour we’ll go over:
- Bazaarvoice’s use case leveraging Apache Spark, Hive, and Alluxio on S3
- How to set up Hive with Alluxio such that Hive jobs can seamlessly read from and write to S3
- Open Session for discussion on any topics such as solving the separation of compute and storage problem, and more
Hybrid Multi-Cloud
Large Scale Analytics Acceleration
video
Online Meetup: Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
ING Bank is a multinational financial services company headquartered in Amsterdam with over $1 trillion in assets. As a leading bank, we place a great emphasis on cybersecurity. One aspect of this is the Security incident and event management (SIEM), which is the process of identifying, monitoring, recording and analyzing security events or incidents within a real-time IT environment. SIEM requires our data platform to have high and consistent performance, so we use open source technologies Presto and Alluxio for fast SQL analytics in the cloud.
In this online presentation, we are going to present how ING is leveraging Presto (interactive query), Alluxio (data orchestration & acceleration), S3 (massive storage), and DC/OS (container orchestration) to build and operate our modern Security Analytics & Machine Learning platform. We will share the challenges we encountered and how we solved them. Today we run this platform in several different data centers, and we have reduced our 10+ minutes queries to under 10 seconds!
Large Scale Analytics Acceleration
Hybrid Multi-Cloud
video
Tech Talk: Accelerating analytics with EMR on your S3 data lake
EMR has become a widely used service to run big data analytics in the public cloud. But issues around slow/inconsistent EMR performance due to S3 data lakes creates challenges for organizations.
Alluxio is a data orchestration layer for the cloud that increases performance of analytic workloads running on AWS EMR using S3 as the storage.
Join us for this webinar where we will show you how to set up EMR Spark and Hive with Alluxio so jobs can seamlessly read from and write to your S3 data lake. You’ll see the performance gains with Alluxio in your EMR/S3 stack.
Large Scale Analytics Acceleration
Hybrid Multi-Cloud
video
Community Office Hour: Building a Cloud Native Stack with EMR Spark, Alluxio, and S3
Many organizations are leveraging EMR to run big data analytics on public cloud. However, reading and writing data to S3 directly can result in slow and inconsistent performance. Alluxio is a data orchestration layer for the cloud, and in this use case it caches data for S3, ensuring high and predictable performance as well as reduced network traffic.
In this Office Hour we go over:
- How to set up EMR Spark with Alluxio such that Spark jobs can seamlessly read from and write to S3
- Compare the performance between Spark on S3 with Spark and Alluxio on S3
- Open Session for discussion on any topics such as solving the separation of compute and storage problem, and more
Large Scale Analytics Acceleration
Hybrid Multi-Cloud
video
Tech Talk: Accelerating Spark with Kubernetes
Kubernetes is widely used across enterprises to orchestrate computation. And while Kubernetes helps improve flexibility and portability for computation in public/hybrid cloud environments across infrastructure providers, running data-intensive workloads can be challenging.
When it comes to efficiently moving data closer to Spark or Presto frameworks, co-locating data with these frameworks and accessing data from multiple or remote clouds is hard to do. That’s where Alluxio, an open source data orchestration platform, can help.
Alluxio enables data locality with your Spark and Presto workloads for faster performance and better data accessibility in Kubernetes. It also provides portability across storage providers.
In this on demand tech talk we’ll give a quick overview of Alluxio and the use cases it powers for Spark/Presto in Kubernetes. We’ll show you how to set up Alluxio and Spark/Presto to run in Kubernetes as well.
Large Scale Analytics Acceleration
Hybrid Multi-Cloud
video
Tech Talk: Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 is the most ambitious platform upgrade since the inception of Alluxio with greatly expanded capabilities to empower users to run analytics and AI workloads on private, public or hybrid cloud infrastructures leveraging valuable data wherever it might be stored.
This release, now available for download, includes many advancements that will allow users to push the limits of their data-workloads in the cloud.
In this tech talk, we will introduce the key new features and enhancements such as:
- Support for hyper-scale data workloads with tiered metadata storage, distributed cluster services, and adaptive replication for increased data locality
- Machine learning and deep learning workloads on any storage with the improved POSIX API
- Better storage abstraction with support for HDFS clusters across different versions & active sync with Hadoop
Large Scale Analytics Acceleration
Hybrid Multi-Cloud
Data Platform Modernization
video
Bay Area Meetup: Alluxio 2.0 Deep Dive and Near Real-time Analytics with Spark
We are excited to present Alluxio 2.0 to our community. The goal of Alluxio 2.0 was to significantly enhance data accessibility with improved APIs, expand use cases supported to include active workloads as well as better metadata management and availability to support hyperscale deployments. Alluxio 2.0 Preview Release is the first major milestone on this path to Alluxio 2.0 and includes many new features.
In this talk, I will give an overview of the motivations and design decisions behind the major changes in the Alluxio 2.0 release. We will touch on the key features:
– New off-Heap metadata storage leveraging embedded RocksDB to scale up Alluxio to handle a billion files;
– Improved Alluxio POSIX API to support legacy and machine-learning workloads;
– A fully contained, distributed embedded journal system based on RAFT consensus algorithm in high availability mode;
– A lightweight distributed compute framework called “Alluxio Job Service” to support Alluxio operations such as active replication, async-persist, cross mount move/copy and distributed loading;
– Support for mounting and connecting to any number of HDFS clusters of different versions at the same time;
Active file system sync between Alluxio and HDFS as under storage.
Alluxio 2.0 Preview Release Deep Dive
We are excited to present Alluxio 2.0 to our community. The goal of Alluxio 2.0 was to significantly enhance data accessibility with improved APIs, expand use cases supported to include active workloads as well as better metadata management and availability to support hyperscale deployments. Alluxio 2.0 Preview Release is the first major milestone on this path to Alluxio 2.0 and includes many new features.
In this talk, I will give an overview of the motivations and design decisions behind the major changes in the Alluxio 2.0 release. We will touch on the key features:
– New off-Heap metadata storage leveraging embedded RocksDB to scale up Alluxio to handle a billion files;
– Improved Alluxio POSIX API to support legacy and machine-learning workloads;
– A fully contained, distributed embedded journal system based on RAFT consensus algorithm in high availability mode;
– A lightweight distributed compute framework called “Alluxio Job Service” to support Alluxio operations such as active replication, async-persist, cross mount move/copy and distributed loading;
– Support for mounting and connecting to any number of HDFS clusters of different versions at the same time;
Active file system sync between Alluxio and HDFS as under storage.
Large Scale Analytics Acceleration
Hybrid Multi-Cloud
video
Democratizing Data Orchestration | Haoyuan (H.Y.) Li – Alluxio
TFiR – Open Source & Emerging Technologies
In this interview we spoke to Haoyuan (H.Y.) Li, Founder, Chairman and CTO of Open Source Alluxio, a company that is democratizing data in the cloud.
Hybrid Multi-Cloud
Data Platform Modernization
video
Tech Talk: Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
The ever increasing challenge to process and extract value from exploding data with AI and analytics workloads makes a memory centric architecture with disaggregated storage and compute more attractive. This decoupled architecture enables users to innovate faster and scale on-demand. Enterprises are also increasingly looking towards object stores to power their big data & machine learning workloads in a cost-effective way. However, object stores don’t provide big data compatible APIs as well as the required performance.
In this webinar, the Intel and Alluxio teams will present a proposed reference architecture using Alluxio as the in-memory accelerator for object stores to enable modern analytical workloads such as Spark, Presto, Tensorflow, and Hive. We will also present a technical overview of Alluxio.
Interested in learning more?
Large Scale Analytics Acceleration
Hybrid Multi-Cloud
Data Platform Modernization
video
Tech Talk: Accelerate Spark Workloads on S3
While running analytics workloads using EMR Spark on S3 is a common deployment today, many organizations face issues in performance and consistency. EMR can be bottlenecked when reading large amounts of data from S3, and sharing data across multiple stages of a pipeline can be difficult as S3 is eventually consistent for read-your-own-write scenarios.
A simple solution is to run Spark on Alluxio as a distributed cache for S3. Alluxio stores data in memory close to Spark, providing high performance, in addition to providing data accessibility and abstraction for deployments in both public and hybrid clouds.
In this webinar you’ll learn how to:
- Increase performance by setting up Alluxio so Spark can seamlessly read from and write to S3
- Use Alluxio as the input/output for Spark applications
- Save and load Spark RDDs and Dataframes with Alluxio
Large Scale Analytics Acceleration
video
O’Reilly AI Conference Keynote: Data Orchestration for AI, Big Data, and Cloud
Alluxio Open Source creator Haoyuan Li‘s keynote at O’Reilly Artificial Intelligence Conference discusses data revolution trend, the inevitable journey of data silos, and the missing piece of the data world – Data Orchestration System!
The data ecosystem has heavily evolved over the past two decades. There’s been an explosion of data-driven frameworks, such as Presto, Hive, and Spark to run analytics and ETL queries and TensorFlow and PyTorch to train and serve models. On the data side, the approach to managing and storing data has evolved from HDFS to cheaper, more scalable and separated services typified by cloud stores like AWS S3. As a result, data engineering has become increasingly complex, inefficient, and hard, particularly in hybrid and cloud environments.
Haoyuan Li offers an overview of a data orchestration layer that provides a unified data access and caching layer for single cloud, hybrid, and multicloud deployments. It enables distributed compute engines like Presto, TensorFlow, and PyTorch to transparently access data from various storage systems (including S3, HDFS, and Azure) while actively leveraging an in-memory cache to accelerate data access.
Hybrid Multi-Cloud
Data Platform Modernization
Data Migration
Large Scale Analytics Acceleration
Model Training Acceleration