On-Demand Videos

Real-time OLAP databases are optimized for speed and often rely on tightly coupled storage-compute architectures using disks or SSDs. Decoupled architectures, which use cloud object storage, introduce an unavoidable tradeoff: cost efficiency at the expense of performance. This makes them unsuitable for databases that need to provide low-latency, real-time analytics, especially the new wave of LLM-powered dashboards, retrieval-augmented generation (RAG), and vector-embedding searches that thrive only when fresh data is milliseconds away. Can we achieve both cost efficiency and performance?
In this talk, we’ll explore the engineering challenges of extending Apache Pinot—a real-time OLAP system—onto cloud object storage while still maintaining sub-second P99 latencies.
We’ll dive into how we built an abstraction in Apache Pinot to make it agnostic to the location of data. We’ll explain how we can query data directly from the cloud (without needing to download the entire dataset, as with lazy-loading) while achieving sub-second latencies. We’ll cover the data fetch and optimization strategies we implemented, such as pipelining fetch and compute, prefetching, selective block fetches, index pinning, and more. We'll also share our latest work about integration with open table formats like iceberg, and how we will continue to achieve fast analytics directly on parquet files by implementing all the same techniques that apply to tiered storage.

The data lake is a fantastic, low-cost place to put data at rest for offline analytics, but we've built it under the terms of a terrible bargain: all that cheap storage at scale was a great thing, but we gave up schema management and transactions along the way. Apache Iceberg has emerged as king of the Open Table Formats to fix this very problem.
Built on the foundation of Parquet files, Iceberg adds a simple yet flexible metadata layer and integration with standard data catalogs to provide robust schema support and ACID transactions to the once ungoverned data lake. In this talk, we'll build Iceberg up from the basics, see how the read and write path work, and explore how it supports streaming data sources like Apache Kafka™. Then we'll see how Confluent's Tableflow brings Kafka together with open table formats like Iceberg and Delta Lake to make operational data in Kafka topics instantly visible to the data lake without the usual ETL—unifying the operational/analytical divide that has been with us for decades.

Storing data as Parquet files on S3 is increasingly used not just as a data lake but also as a lightweight feature store for ML training/inference or a document store for RAG. However, querying petabyte- to exabyte-scale data lakes directly from cloud object storage remains notoriously slow (e.g., latencies ranging from hundreds of milliseconds to several seconds on AWS S3).
In this talk, we show how architecture co-design, system-level optimizations, and workload-aware engineering can deliver over 1000× performance improvements for these workloads—without changing file formats, rewriting data paths, or provisioning expensive hardware.
We introduce a high-performance, low-latency S3 proxy layer powered by Alluxio, deployed atop hyperscale data lakes. This proxy delivers sub-millisecond Time-to-First-Byte (TTFB)—on par with Amazon S3 Express—while preserving compatibility with standard S3 APIs. In real-world benchmarks, a 50-node Alluxio cluster sustains over 1 million S3 queries per second, offering 50× the throughput of S3 Express for a single account, with no compromise in latency.
Beyond accelerating access to Parquet files byte-to-byte, we also offload partial Parquet processing from query engines via a pluggable interface into Alluxio. This eliminates the need for costly index scans and file parsing, enabling point queries with 0.3 microseconds latency and up to 3,000 QPS per instance (measured using a single-thread)—a 100× improvement over traditional query paths.
.png)
As enterprises race to roll out artificial intelligence, often overlookModel training requires extensive computational and GPU resources. When training models on AWS, loading data from S3 often becomes a major bottleneck, wasting valuable GPU cycles. Optimizing data loading can greatly reduce GPU idle time and increase GPU utilization.
In this webinar, Greg Palmer will discuss best practices for efficient data loading during model training on AWS. He will demonstrate how to use Alluxio on EKS as a distributed cache to accelerate PyTorch training jobs that read datasets from S3. This architecture significantly improves the utilization of GPUs from 30% to 90%+, archives ~5x faster training, and lower cloud storage costs.
What you will learn:
- The challenges of feeding data-hungry GPUs in the cloud
- How to accelerate model training by optimizing data loading on AWS
- The reference architecture for running PyTorch jobs with Alluxio cache on EKS while reading data from S3, with benchmark results of training ResNet50 and BERT
- How to use TensorBoard to identify bottlenecks in GPU utilization
As enterprises race to roll out artificial intelligence, often overlooked are the infrastructure needs to support scalable ML model development and deployment. Efforts to effectively access and utilize GPUs often lead to extensive data engineering managing data copies or specialized storage, leading to out-of-control cloud and infrastructure costs.
To address the challenges, enterprises need a new data access layer to connect compute engines to data stores wherever they reside in distributed environments.
Join this webinar with Kevin Petrie, Eckerson Group VP of Research, and Sridhar Venkatesh, Alluxio SVP of Product, to explore tools, techniques, and best practices to remove data access bottlenecks and accelerate AI/ML model training. You will learn:
- Modern requirements for AI/ML model training and data engineering
- The challenges of GPU utilization in machine learning and the need for specialized hardware
- How a new data access layer connects compute to data stores across environments
- Best practices for optimizing ML training and guiding principles for success
Organizations are retooling their enterprise data infrastructure in the race for AI/ML. However, growing datasets, extensive data engineering overhead, high GPU costs, and expensive specialized storage can make it difficult to get fast results from model development.
The data access layer is the key to accelerating your path to AI/ML. In this webinar, Roland Theron, Senior Solutions Engineer at Alluxio, discusses how the data access layer can help you:
- Build AI architecture on your existing data lake without the need for specialized hardware.
- Streamline the time-consuming process of managing data copies in data engineering.
- Speed up training workloads with high GPU utilization.
- Achieve optimal concurrency to deliver models to inference clusters for demanding applications
Join us with David Loshin, President of Knowledge Integrity, and Sridhar Venkatesh, SVP of Product at Alluxio, to learn more about the infrastructure hurdles associated with AI/ML model training and deployment and how to overcome them. Topics include:
- The challenges of AI and model training
- GPU utilization in machine learning and the need for specialized hardware
- Managing data access and maintaining a source of truth in data lakes
- Best practices for optimizing ML training
When training models on ultra-large datasets, one of the biggest challenges is low GPU utilization. These powerful processors are often underutilized due to inefficient I/O and data access. This mismatch between computation and storage leads to wasted GPU resources, low performance, and high cloud storage costs. The rise of generative AI and GPU scarcity is only making this problem worse.
In this webinar, Tarik and Beinan discuss strategies for transforming idle GPUs into optimal powerhouses. They will focus on cost-effective management of ultra-large datasets for AI and analytics.
What you will learn:
- The challenges of I/O stalls leading to low GPU utilization for model training
- High-performance, high-throughput data access (I/O) strategies
- The benefits of using an on-demand data access layer over your storage
- How Uber addresses managing ultra-large datasets using high-density storage and caching
As the AI landscape rapidly evolves, the advancements in generative AI technologies, such as ChatGPT, are driving a need for robust data infrastructures tailored for large language model (LLM) training and inference in the cloud. To effectively leverage the breakthroughs in LLM, organizations must ensure low latency, high concurrency, and scalability in production environments.
In this Alluxio-hosted webinar, Shouwei presented on the design and implementation of a distributed caching system that addresses the I/O challenges of LLM training and inference. He explored the unique requirements of data access patterns and offer practical best practices for optimizing the data pipeline through distributed caching in the cloud. The session featured insights from real-world examples, such as Microsoft, Tencent, and Zhihu, as well as from the open-source community. Watch this recording to get a deeper understanding of how to harness scalable, efficient, and robust data infrastructures for LLM training and inference.
Shawn Sun, Alluxio’s software engineer, shares how to get started with Alluxio on Kubernetes in April’s Product School Webinar.
To simplify the DevOps of the stack of Alluxio with a query engine, Alluxio has provided two ways to deploy on Kubernetes, helm and operator. They significantly simplify the deployment, configuration, and life cycle management of resources on Kubernetes.
Through this webinar, you will learn step-by-step how to deploy and run Alluxio on Kubernetes to accelerate analytics workloads.
In March’s Product School session, Beinan, an Alluxio tech lead, Presto committer, and Trino contributor, shares expert tips for tuning Trino performance. In addition, he demonstrates how to integrate Trino with Alluxio as a caching layer using connectors for Hive, Iceberg, Hudi, or Delta Lake.
In February’s product school, Greg Palmer, Lead Solution Engineer at Alluxio, will present a live demo featuring Transparent URI, a key feature in Alluxio Enterprise Edition which provides ease of integration of Alluxio with your existing data stack without any changes to the location metadata of the Hive Metastore. Join us to learn the configurations and other advanced settings for employing Transparent URI to simplify DevOps of Alluxio implementation, allowing users to access their existing storage systems without changing URIs at application level.
In November’s Product School, Adit Madan, Director of Product Management at Alluxio, will highlights new features, enhanced manageability, improved security and performance in Alluxio 2.9 release.
Big Data Bellevue Meetup
May 19, 2022
Today, data engineering in modern enterprises has become increasingly more complex and resource-consuming, particularly because (1) the rich amount of organizational data is often distributed across data centers, cloud regions, or even cloud providers, and (2) the complexity of the big data stack has been quickly increasing over the past few years with an explosion in big-data analytics and machine-learning engines (like MapReduce, Hive, Spark, Presto, Tensorflow, PyTorch to name a few).
To address these challenges, it is critical to provide a single and logical namespace to federate different storage services, on-prem or cloud-native, to abstract away the data heterogeneity, while providing data locality to improve the computation performance. [Bin Fan] will share his observation and lessons learned in designing, architecting, and implementing such a system – Alluxio open-source project — since 2015.
Alluxio originated from UC Berkeley AMPLab (used to be called Tachyon) and was initially proposed as a daemon service to enable Spark to share RDDs across jobs for performance and fault tolerance. Today, it has become a general-purpose, high-performance, and highly available distributed file system to provide generic data service to abstract away complexity in data and I/O. Many companies and organizations today like Uber, Meta, Tencent, Tiktok, Shopee are using Alluxio in production, as a building block in their data platform to create a data abstraction and access layer. We will talk about the journey of this open source project, especially in its design challenges in tiered metadata storage (based on RocksDB), embedded state-replicate machine (based on RAFT) for HA, and evolution in RPC framework (based on gRPC) and etc.
Meetup Group
Big Data Bellevue: https://www.meetup.com/big-data-bellevue-bdb/
Big Data Bellevue & Cloudy With a Chance of Data Meetup
October 20, 2022
Distributed systems are made up of many components such as authentication, a persistence layer, stateless services, load balancers, and stateful coordination services. These coordination services are central to the operation of the system, performing tasks such as maintaining system configuration state, ensuring service availability, name resolution, and storing other system metadata. Given their central role in the system it is essential that these systems remain available, fault tolerant and consistent. By providing a highly available file system-like abstraction as well as powerful recipes such as leader election, Apache Zookeeper is often used to implement these services. This talk will go over a generic example of stateful coordination service moving from Zookeeper to Raft.
Meetup Groups
Big Data Bellevue: https://www.meetup.com/big-data-bellevue-bdb/
Cloudy With a Chance of Data: https://www.meetup.com/meetup-datascience/