On-Demand Videos

Real-time OLAP databases are optimized for speed and often rely on tightly coupled storage-compute architectures using disks or SSDs. Decoupled architectures, which use cloud object storage, introduce an unavoidable tradeoff: cost efficiency at the expense of performance. This makes them unsuitable for databases that need to provide low-latency, real-time analytics, especially the new wave of LLM-powered dashboards, retrieval-augmented generation (RAG), and vector-embedding searches that thrive only when fresh data is milliseconds away. Can we achieve both cost efficiency and performance?
In this talk, we’ll explore the engineering challenges of extending Apache Pinot—a real-time OLAP system—onto cloud object storage while still maintaining sub-second P99 latencies.
We’ll dive into how we built an abstraction in Apache Pinot to make it agnostic to the location of data. We’ll explain how we can query data directly from the cloud (without needing to download the entire dataset, as with lazy-loading) while achieving sub-second latencies. We’ll cover the data fetch and optimization strategies we implemented, such as pipelining fetch and compute, prefetching, selective block fetches, index pinning, and more. We'll also share our latest work about integration with open table formats like iceberg, and how we will continue to achieve fast analytics directly on parquet files by implementing all the same techniques that apply to tiered storage.

The data lake is a fantastic, low-cost place to put data at rest for offline analytics, but we've built it under the terms of a terrible bargain: all that cheap storage at scale was a great thing, but we gave up schema management and transactions along the way. Apache Iceberg has emerged as king of the Open Table Formats to fix this very problem.
Built on the foundation of Parquet files, Iceberg adds a simple yet flexible metadata layer and integration with standard data catalogs to provide robust schema support and ACID transactions to the once ungoverned data lake. In this talk, we'll build Iceberg up from the basics, see how the read and write path work, and explore how it supports streaming data sources like Apache Kafka™. Then we'll see how Confluent's Tableflow brings Kafka together with open table formats like Iceberg and Delta Lake to make operational data in Kafka topics instantly visible to the data lake without the usual ETL—unifying the operational/analytical divide that has been with us for decades.

Storing data as Parquet files on S3 is increasingly used not just as a data lake but also as a lightweight feature store for ML training/inference or a document store for RAG. However, querying petabyte- to exabyte-scale data lakes directly from cloud object storage remains notoriously slow (e.g., latencies ranging from hundreds of milliseconds to several seconds on AWS S3).
In this talk, we show how architecture co-design, system-level optimizations, and workload-aware engineering can deliver over 1000× performance improvements for these workloads—without changing file formats, rewriting data paths, or provisioning expensive hardware.
We introduce a high-performance, low-latency S3 proxy layer powered by Alluxio, deployed atop hyperscale data lakes. This proxy delivers sub-millisecond Time-to-First-Byte (TTFB)—on par with Amazon S3 Express—while preserving compatibility with standard S3 APIs. In real-world benchmarks, a 50-node Alluxio cluster sustains over 1 million S3 queries per second, offering 50× the throughput of S3 Express for a single account, with no compromise in latency.
Beyond accelerating access to Parquet files byte-to-byte, we also offload partial Parquet processing from query engines via a pluggable interface into Alluxio. This eliminates the need for costly index scans and file parsing, enabling point queries with 0.3 microseconds latency and up to 3,000 QPS per instance (measured using a single-thread)—a 100× improvement over traditional query paths.
.png)
ALLUXIO COMMUNITY OFFICE HOUR
We are extremely excited to announce the release of Alluxio 2.4.0!
Alluxio 2.4.0 focuses on features critical to large scale, production deployments in Cloud and Hybrid Cloud environments. Features such as highly scalable metadata journaling, aggregate cluster metrics monitoring, and automated detection of JVM pauses further improve Alluxio’s suitability for demanding workloads. Devops tools are also key for triaging issues when they occur. In Alluxio 2.4 we further improve the cluster wide log collection framework. Finally, Alluxio is continually expanding its state of the art integrations with frameworks and storage systems. Alluxio 2.4 introduces and improves integrations with Kubernetes, Azure Data Lake Storage, and Apache Ozone. Alluxio 2.4 is also the first Alluxio release that has support for Java 11.
In this Office Hour, we will go over:
- Expanded metadata service
- Cloud native deployment
- Simplified DevOps and system monitoring
- Support for Java 11
ALLUXIO COMMUNITY OFFICE HOUR
In this talk, we describe the architecture to migrate analytics workloads incrementally to any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) directly on on-prem data without copying the data to cloud storage.
In this Office Hour:
- We will go over an architecture for running elastic compute clusters in the cloud using on-prem HDFS.
- Have a casual online video chat with Alluxio Open Source core maintainers to address any Alluxio related questions from our community members
Over the last few years, organizations have worked towards the separation of storage and compute for a number of benefits in the areas of cost, data duplication and data latency. Cloud resolves most of these issues but comes to the expense of needing a way to query data on remote storages. Alluxio and Presto are a powerful combination to address the compute problem, which is part of the strategy used by Simbiose Ventures to create a product called StorageQuery – A platform to query files in cloud storages with SQL.
This talk will focus on:
- How Alluxio fits StorageQuery’s tech stack;
- Advantages of using Alluxio as a cache layer and its unified filesystem
- Development of new under file system for Backblaze B2 and fine-grained code documentation;
- ShannonDB remote storage mode.
ALLUXIO COMMUNITY OFFICE HOUR
Alluxio 2.3 was just released at the end of June 2020. Calvin and Bin will go over the new features and integrations available and share learnings from the community. Any questions about the release and on-going community feature development are welcome.
In this Office Hour, we will go over:
- Glue Under Database integration
- Under Filesystem mount wizard
- Tiered Storage Enhancements
- Concurrent Metadata Sync
- Delegated Journal Backups
The hybrid cloud model, where cloud resources run Spark or Presto jobs against data stored on-premises, is an appealing solution to reduce resource contention in on-premise environments while also saving in overall costs. One key flaw in a hybrid model is the overhead associated with transferring data between the two environments. Data and metadata locality within the compute application must be achieved in order to maintain the similar performance of analytics jobs as if the entire workload was run on-premises.
In this office hour, we demonstrate how a “zero-copy burst” solution helps to speed up Spark and Presto queries in the public cloud while eliminating the process of manually copying and synchronizing data from the on-premise data lake to cloud storage. This approach allows compute frameworks to decouple from on-premise data sources and scale efficiently by leveraging Alluxio and public cloud resources such as AWS.
We will cover:
- Typical challenges of moving data to the cloud and expanding compute capacity.
- Details about “zero-copy” hybrid cloud solution for burst computing
- A demo of running Presto analytic queries using remote on-prem HDFS data with Alluxio deployed in AWS EMR
ALLUXIO TECH TALK
As the amount of data analyzed and stored continues to grow exponentially, fixed on-premises infrastructure like Apache Hadoop data lakes becomes costly. Add to that the need to support newer and popular frameworks on an already busy data lake, it is not uncommon to see Hadoop-based data lakes running at beyond 100% utilization and hybrid processing split between physical and cloud infrastructure. As a result, companies are looking to leverage the flexibility and cost savings of the cloud.
Join us for this tech talk where we will show you how Alluxio can help burst your private computing environment to Google Cloud, minimizing costs and I/O overhead. Alluxio coupled with Google’s open source data and analytics processing engine, Dataproc, enables zero-copy burst for faster query performance in the cloud so you can take advantage of resources that are not local to your data, without the need for managing the copying or syncing of that data.
We’ll also show a demo on how to get up and running with Alluxio and Dataproc, including how to:
- Setup your hybrid environment between your private datacenter and Google Cloud Platform
- Burst a Spark based machine learning algorithm to Dataproc while accessing on-prem data
- Scale analytic workloads directly on data on-prem without copying and synchronizing the data into the cloud
ALLUXIO COMMUNITY OFFICE HOUR
Today’s conventional wisdom states that network latency across the two ends of a hybrid cloud prevents you from running analytic workloads in the cloud with the data on-prem. As a result, most companies copy their data into a cloud environment and maintain that duplicate data. All of this means that it is challenging to make both on-prem HDFS data accessible with the desired application performance.
In this talk, we will show you how to leverage any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) to scale analytics workloads directly on on-prem data without copying and synchronizing the data into the cloud.
In this Office Hour, we will go over:
- A strategy to embrace the hybrid cloud, including an architecture for running ephemeral compute clusters using on-prem HDFS.
- An example of running on-demand Presto, Spark, and Hive with Alluxio in the public cloud.
- An analysis of experiments with TPC-DS to demonstrate the benefits of the given architecture.
ALLUXIO COMMUNITY OFFICE HOUR
Alluxio (alluxio.io) is an open-source data orchestration system that provides a single namespace federating multiple external distributed storage systems. It is critical for Alluxio to be able to store and serve the metadata of all files and directories from all mounted external storage both at scale and at speed.
This talk shares our design, implementation, and optimization of Alluxio metadata service (master node) to address the scalability challenges. Particularly, we will focus on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc. As a result of the combined above techniques, Alluxio 2.0 is able to store at least 1 billion files with a significantly reduced memory requirement, serving 3000 workers and 30000 clients concurrently.
In this Office Hour, we will go over how to:
- Metadata storage challenges
- How to combine different open source technologies as building blocks
- The design, implementation, and optimization of Alluxio metadata service
ALLUXIO COMMUNITY OFFICE HOUR
Presto, an open-source distributed SQL engine, is commonly used to query an existing Hive data warehouse. Due to existing applications, tech debt or operational challenges in the past, Presto may not be able to achieve its full potential but bound and limited by the past decisions. Particularly, challenges include overloaded Hive Metastore with slow and unpredictable access, unoptimized data formats and layouts such as too many small files, or lack of influence over the existing Hive system and other Hive applications.
Ideally, Presto would access data independently from how the data was originally stored or managed. Alluxio, as a data orchestration layer provides the physical data independence, for Presto to interact with the data more efficiently. In addition to caching for IO acceleration, Alluxio also provides a catalog service to abstract the metadata in the Hive Metastore, and transformations to expose the data in compute-optimized way. In this talk, we describe some of the challenges of using Presto with Hive, and introduce Alluxio data orchestration for solving those challenges.
In this Office Hour, we will go over:
- Typical challenges of using Presto with Hive
- Overview of the different services of Alluxio Structured Data Management in Alluxio 2.1
- A demo of using Alluxio Structured Data Management with Presto
ALLUXIO COMMUNITY OFFICE HOUR
Accessing data to run analytic workloads in Spark across data centers and/or clouds can be challenging. Additionally, network I/O can bottleneck Spark jobs that need to read a large amount of data. A common solution is to deploy an HDFS cluster closer to Spark as a caching layer and manually copy the input data to HDFS first, purging it afterward. But this ETL process can be both time-consuming and also error-prone.
A more efficient and simpler solution is to run Spark on Alluxio as a distributed cache on top of the remote data source. While caching data transparently based on access patterns and storing the working set closer, Alluxio provides Spark jobs much higher I/O throughput with enhanced data locality. In addition, Alluxio also provides data accessibility and abstraction for deployments in hybrid and multi-cloud environments.
In this Office Hour, we will go over how to:
- Burst on-prem Spark workloads to the cloud with Alluxio so Spark can seamlessly read from and write to remote data storage
- Use Alluxio as the input/output for Spark applications
- Save and load Spark RDDs and Dataframes with Alluxio
ALLUXIO COMMUNITY OFFICE HOUR
Building distributed systems is no small feat. Software testing is just one of many critical practices that engineers who build these systems need to utilize to ensure the quality and usability of their software. For distributed systems, scaling out testing frameworks to ensure that enterprises who run our in highly distributed environments is a complicated (and expensive task!)
In this online meetup, you will learn about:
- How the engineers at Alluxio have approached testing at scale
- Approaches to maintaining distributed systems at scale
ALLUXIO COMMUNITY OFFICE HOUR
Many organizations are leveraging EMR to run big data analytics on public cloud. However, reading and writing data to S3 directly can result in slow and inconsistent performance. Alluxio is a data orchestration layer for the cloud, and in this use case it caches data for S3, ensuring high and predictable performance as well as reduced network traffic.
In this office hour, you will learn about:
- How to set up Alluxio with the EMR stack so that Presto jobs can seamlessly read from and write to S3
- Compare the performance between Presto on EMR with Presto and Alluxio on EMR
- Open Session for discussion on any topics such as solving the separation of compute and storage problem, and more