On-Demand Videos

video

AI/ML Infra Meetup | Open Source Michelangelo: Uber's Predictive to Generative end to end ML Lifecycle management platform

In this talk, Eric Wang, Senior Staff Software Engineer introduces Uber’s open-source generative end-to-end ML lifecycle management platform: Michelangelo.

Watch now

video

AI/ML Infra Meetup | Unlock the Future of Generative AI: TorchTitan's Latest Breakthroughs

In this talk, Jiani Wang, Software Engineer Meta's Pytorch Team, dives into the overview and the latest advancements in TorchTitan.

Watch now

video

AI/ML Infra Meetup | Bringing Data to GPUs Anywhere + Get Low-Latency on Object Store with Alluxio

In this talk, Bin Fan, VP of Technology at Alluxio, explores how to enable efficient data access across distributed GPU infrastructure, achieving low-latency performance for feature stores and RAG workloads.

Watch now

video

Running Presto with Alluxio on Amazon EMR

ALLUXIO COMMUNITY OFFICE HOUR

Many organizations are leveraging EMR to run big data analytics on public cloud. However, reading and writing data to S3 directly can result in slow and inconsistent performance. Alluxio is a data orchestration layer for the cloud, and in this use case it caches data for S3, ensuring high and predictable performance as well as reduced network traffic.

In this office hour, you will learn about:

How to set up Alluxio with the EMR stack so that Presto jobs can seamlessly read from and write to S3
Compare the performance between Presto on EMR with Presto and Alluxio on EMR
Open Session for discussion on any topics such as solving the separation of compute and storage problem, and more

Watch now

video

Tech Talk: Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

With the advent of the public clouds and data increasingly siloed across many locations — on premises and in the public cloud — enterprises are looking for more flexibility and higher performance approaches to analyze their structured data.

Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about:

The architecture of Presto, an open source distributed SQL engine
How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics
Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted

‍

Watch now

video

Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, NFS, DC OS, & Alluxio

feat. Apple Case Study using Tensorflow, NFS, DC/OS, and Alluxio

ALLUXIO ONLINE MEETUP

Data scientists or platform engineers often face the following challenge when the input data for machine learning jobs are stored in remote storage like NFS or cloud storage like S3. Making direct data access is slow, unstable and expensive; manually duplicating data to the training clusters also introduces large overhead, complicated data curation and often requires engineers to build ETL pipelines.

This talk will guide the audience on how Alluxio can greatly simplify the data preparation phase in with remote and possibly multiple data sources. We will share the lessons and benchmark from Bill Zhao an engineer led in Apple when building a Machine Learning platform using Tensorflow, NFS, DC/OS and Alluxio.

In this online meetup, you will learn about:

When Alluxio can help for machine learning platform;
How to setup and create POSIX endpoint for Alluxio service to unify the file system data access to S3, HDFS and Azure blob storage;
How to run TensorFlow to train models backed by accessing remote input data like access local file system.

Watch now

video

Community Office Hour: Hands-on with Alluxio Structured Data Management

ALLUXIO COMMUNITY OFFICE HOUR

Users deploy Alluxio in a wide range of use cases from analytics to AI platforms, for Alluxio’s unified access to data and transparent caching for acceleration. However, many frameworks are SQL engines, like Presto, Apache Spark SQL, or Apache Hive, and consume data structured as tables of rows and columns. Since Alluxio is commonly used as a filesystem of files and directories, there is a mismatch between how Alluxio exposes data (files, directories), and how SQL engines deal with data (tables, rows, columns). This gap creates various challenges and inefficiencies.

Therefore, in the Alluxio 2.1 release, we introduce Alluxio Structured Data Management, which is a new set of services that enables structured data applications to interact with data more efficiently. The new services include the catalog service and a transformation service, which all work together to bridge the gap between storage and SQL engines and enable physical data independence.

In this office hour, we introduce the concepts and components of Alluxio Structured Data Management, and go through a demo with Presto.

In this Office Hour we’ll go over:

Introduction and motivation of Alluxio Structured Data Management
Overview of the different services of Alluxio Structured Data Management in Alluxio 2.1
A demo of using Alluxio Structured Data Management with Presto

Watch now

video

Community Office Hour: Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio

ALLUXIO COMMUNITY OFFICE HOUR

While adoption of the Cloud & Kubernetes has made it exceptionally easy to scale compute, the increasing spread of data across different systems and clouds has created new challenges for data engineers. Effectively accessing data from AWS S3 or on-premises HDFS becomes harder and data locality is also lost – how do you move data to compute workers efficiently, how do you unify data across multiple or remote clouds, and many more. Open source project Alluxio approaches this problem in a new way. It helps elastic compute workloads, such as Apache Spark, realize the true benefits of the cloud while bringing data locality and data accessibility to workloads orchestrated by Kubernetes.

One important performance optimization in Apache Spark is to schedule tasks on nodes with HDFS data nodes locally serving the task input data. However, more users are running Apache Spark natively on Kubernetes where HDFS is not an option. This office hour describes the concept and dataflow with respect to using the stack of Spark/Alluxio in Kubernetes with enhanced data locality even if the storage service is outside or remote.

In this Office Hour we’ll go over:

Why Spark is able to make a locality-aware schedule when working with Alluxio in K8s environment using the host network
Why a pod running Alluxio can share data efficiently with a pod running Spark on the same host using domain socket and host path volume
The roadmap to improve this Spark / Alluxio stack in the context of K8s

Watch now

video

Tech Talk: Integrating Google Cloud Dataproc with Alluxio for faster performance in the cloud

Google Cloud Dataproc is a widely used fully managed Spark and Hadoop service to run big data analytics and compute workloads in the cloud. Services like Dataproc reduce hardware spend, eliminate the need to overbuy capacity, and provide business agility. Yet users still face challenges for performance sensitive workloads or workloads running on remote data.

Alluxio is an open source cloud data orchestration platform that increases performance of analytic workloads running on Dataproc by intelligently caching data and bringing back lost data locality. Alluxio also enables users to run compute workloads against on-prem storage like Hadoop HDFS without any app changes.

Chris Crosbie and Roderick Yao from the Google Dataproc team and Dipti Borkar of Alluxio demo how to set up Google Cloud Dataproc with Alluxio so jobs can seamlessly read from and write to Cloud Storage. They also show how to run Dataproc Spark against a remote HDFS cluster.

‍

Watch now

video

Tech Talk: The Path to Migrating off MapR

If you’re a MapR user, you might have concerns with your existing data stack. Whether it’s the complexity of Hadoop, financial instability and no future MapR product roadmap, or no flexibility when it comes to co-locating storage and compute, MapR may no longer be working for you.

Alluxio can help you migrate to a modern, disaggregated data stack using any object store with the similar performance of Hadoop plus significant cost savings.

Join us for this tech talk where we’ll discuss how to separate your compute and storage on-prem and architect a new data stack that makes your object store the core. We’ll show you how to offload your MapR/HDFS compute to any object store and how to run all of your existing jobs as-is on Alluxio + object store.

‍

Watch now

video

Community Office Hour: Improving Memory Utilization of Spark Jobs Using Alluxio

ALLUXIO COMMUNITY OFFICE HOUR

Apache Spark has been widely adopted for in-memory data analytics at scale, however, efficient memory utilization is a common challenge, and users will either run out of memory or experience low and unstable performance. Many Spark users may not be aware of the differences in memory utilization between caching data directly in-memory into the Spark JVM versus storing data off-heap via an in-memory storage service like Alluxio. In this office hour, I will highlight the two approaches with a demo and open up for discussions

In this Office Hour we’ll go over:

How to run Spark shell with Alluxio such that Spark jobs
A demo to compare the memory usage between Spark cache and using Alluxio as the external off-heap caching service
Open Session for discussion on any topics such as running Presto on Alluxio, and more

Watch now

video

Tech Talk: How the Development Bank of Singapore solves on-prem compute capacity challenges with cloud bursting

The DBS team was tasked to solve their compute capacity problem. They wanted to provide faster insights and analyze data for a range of use cases but didn’t have the ability to scale compute elastically on-prem.

One use case that challenged them was customer call analysis. With the millions of customer calls they get every year, DBS manages over 50TB of customer data and audio files. This data needed to reside on-prem for compliance reasons. With on-prem compute limitations, they looked to the public cloud to analyze this data and selected “zero-copy” bursting as the best approach.

In this tech talk, we’ll discuss why DBS turned to Alluxio’s bursting approach to help solve these challenges. Vitaliy Baklikov, SVP at DBS, will discuss:

Challenges and inefficiencies with their prior data stack
Moving to a disaggregated data stack using Alluxio
Bursting data without persisting in the cloud
An overview of Alluxio’s “zero-copy” hybrid bursting solution

‍

Watch now

video

tf.data: TensorFlow Input Pipeline

tf.data is the recommended API for creating TensorFlow input pipelines and is relied upon by countless external and internal Google users. The API enables you to build complex input pipelines from simple, reusable pieces and makes it possible to handle large amounts of data, different data formats, and perform complex transformations. In this talk, I will present an overview of the project and highlight best practices for creating performant input pipelines.

Watch now

video

Apache Iceberg – A Table Format for Huge Analytic Datasets

Apache Iceberg is a new format for tracking very large scale tables that are designed for object stores like S3. This talk will include why Netflix needed to build Iceberg, the project’s high-level design, and will highlight the details that unblock better query performance.

Watch now

video

Orchestrate a Data Symphony

In this keynote, Haoyuan will discuss the key challenges and trends impacting data engineering, and explore the concept of Data Orchestration.

‍

Watch now