On-Demand Videos

Real-time OLAP databases are optimized for speed and often rely on tightly coupled storage-compute architectures using disks or SSDs. Decoupled architectures, which use cloud object storage, introduce an unavoidable tradeoff: cost efficiency at the expense of performance. This makes them unsuitable for databases that need to provide low-latency, real-time analytics, especially the new wave of LLM-powered dashboards, retrieval-augmented generation (RAG), and vector-embedding searches that thrive only when fresh data is milliseconds away. Can we achieve both cost efficiency and performance?
In this talk, we’ll explore the engineering challenges of extending Apache Pinot—a real-time OLAP system—onto cloud object storage while still maintaining sub-second P99 latencies.
We’ll dive into how we built an abstraction in Apache Pinot to make it agnostic to the location of data. We’ll explain how we can query data directly from the cloud (without needing to download the entire dataset, as with lazy-loading) while achieving sub-second latencies. We’ll cover the data fetch and optimization strategies we implemented, such as pipelining fetch and compute, prefetching, selective block fetches, index pinning, and more. We'll also share our latest work about integration with open table formats like iceberg, and how we will continue to achieve fast analytics directly on parquet files by implementing all the same techniques that apply to tiered storage.

The data lake is a fantastic, low-cost place to put data at rest for offline analytics, but we've built it under the terms of a terrible bargain: all that cheap storage at scale was a great thing, but we gave up schema management and transactions along the way. Apache Iceberg has emerged as king of the Open Table Formats to fix this very problem.
Built on the foundation of Parquet files, Iceberg adds a simple yet flexible metadata layer and integration with standard data catalogs to provide robust schema support and ACID transactions to the once ungoverned data lake. In this talk, we'll build Iceberg up from the basics, see how the read and write path work, and explore how it supports streaming data sources like Apache Kafka™. Then we'll see how Confluent's Tableflow brings Kafka together with open table formats like Iceberg and Delta Lake to make operational data in Kafka topics instantly visible to the data lake without the usual ETL—unifying the operational/analytical divide that has been with us for decades.

Storing data as Parquet files on S3 is increasingly used not just as a data lake but also as a lightweight feature store for ML training/inference or a document store for RAG. However, querying petabyte- to exabyte-scale data lakes directly from cloud object storage remains notoriously slow (e.g., latencies ranging from hundreds of milliseconds to several seconds on AWS S3).
In this talk, we show how architecture co-design, system-level optimizations, and workload-aware engineering can deliver over 1000× performance improvements for these workloads—without changing file formats, rewriting data paths, or provisioning expensive hardware.
We introduce a high-performance, low-latency S3 proxy layer powered by Alluxio, deployed atop hyperscale data lakes. This proxy delivers sub-millisecond Time-to-First-Byte (TTFB)—on par with Amazon S3 Express—while preserving compatibility with standard S3 APIs. In real-world benchmarks, a 50-node Alluxio cluster sustains over 1 million S3 queries per second, offering 50× the throughput of S3 Express for a single account, with no compromise in latency.
Beyond accelerating access to Parquet files byte-to-byte, we also offload partial Parquet processing from query engines via a pluggable interface into Alluxio. This eliminates the need for costly index scans and file parsing, enabling point queries with 0.3 microseconds latency and up to 3,000 QPS per instance (measured using a single-thread)—a 100× improvement over traditional query paths.
.png)
Enterprises everywhere are racing to build the optimal analytics stack for creating repeatable success with predictive analytics, machine learning, and data applications. Cloud data platforms like data warehouses and data lakes are foundational elements of these software stacks and their associated data pipelines. But existing SQL query methods against these data platforms have repeatedly demonstrated disappointing performance and scaling due to poor concurrency.
In this presentation, we will discuss the use of the intelligent precomputation capabilities of Kyligence Cloud as a means of delivering on the promise of pervasive analytics at scale with massive concurrency and sub-second query latencies on large datasets in the cloud.
Kyligence, with our partner Alluxio, sits between the data platform and the processing layer. Kyligence Cloud delivers precomputed datasets for OLAP queries, BI dashboards, and machine learning applications.
In most of the distributed storage systems, the data nodes are decoupled from compute nodes. This is motivated by an improved cost efficiency, storage utilization and a mutually independent scalability of computation and storage. While this consideration is indisputable, several situations exist where moving computation close to the data brings important benefits. Whenever the stored data is to be processed for analytics purposes, all the data needs to be repeatedly moved from the storage to the compute cluster, which leads to reduced performance.
In this talk, we will present how using Alluxio computation and storage ecosystems can better interact benefiting of the “bringing the data close to the code” approach. Moving away from the complete disaggregation of computation and storage, data locality can enhance the computation performance. During this talk, we will present our observations and testing results that will show important enhancements in accelerating Spark Data Analytics on Ceph Objects Storage using Alluxio.
At PayPal & any other data driven enterprise – data users & applications work with a variety of data sources (RDBMS, NoSQL, Messaging, Documents, Big Data, Time Series Databases), compute engines (Spark, Flink, Beam, Hive), languages (Scala, Python, SQL) and execution models (stream, batch, interactive) to process petabytes of data. Due to this complex matrix of technologies and thousands of datasets, engineers spend considerable time learning about different data sources, formats, programming models, APIs, optimizations, etc. which impacts time-to-market (TTM).
To solve this problem and to make product development more effective, PayPal Data Platforms developed “Gimel”, an open source, unified analytics data platform which provides access to any storage through a single unified data API and SQL, which are powered by a centralized data catalog.
In this talk, Baolong Mao from Tencent will share his experience in developing Apache Ozone under file system, showing how to create a new Under File System in a few steps with minimal lines of code.
JD.com is one of the largest e-commerce corporations. In big data platform of JD.com, there are tens of thousands of nodes and tens of petabytes off-line data which require millions of spark and MapReduce jobs to process everyday. As the main query engine, thousands of machines work as Presto nodes and Presto plays an import role in the field of In-place analysis and BI tools. Meanwhile, Alluxio is deployed to improve the performance of Presto. The practice of Presto & Alluxio in JD.com benefits a lot of engineers and analysts.
Data platforms span multiple clusters, regions and clouds to meet the business needs for agility, cost effectiveness, and efficiency. Organizations building data platforms for structured and unstructured data have standardized on separation of storage and compute to remain flexible while avoiding vendor lock-in. Data orchestration has emerged as the foundation of such a data platform for multiple use cases all the way from data ingestion to transformations to analytics and AI.
In this keynote from Haoyuan Li, founder and CEO of Alluxio, we will showcase how organizations have built data platforms based on data orchestration. The need to simplify data management and acceleration across different business personas has given rise to data orchestration as a requisite piece of the modern data platform. In addition, we will outline typical journeys for realizing a hybrid and multi-cloud strategy.
In this keynote, Calvin Jia will share some of the hottest use cases in Alluxio 2 and discuss the future directions of the project being pioneered by Alluxio and the community. Bin Fan will provide an overview of the growth of Alluxio open-source community with highlights on community-driven collaboration with engineering teams from Microsoft and Alibaba to advance the technology.
Distributed applications are not new. The first distributed applications were developed over 50 years ago with the arrival of computer networks, such as ARPANET. Since then, developers have leveraged distributed systems to scale out applications and services, including large-scale simulations, web serving, and big data processing. However, until recently, distributed applications have been the exception, rather than the norm. However, this is changing quickly. There are two major trends fueling this transformation: the end of Moore’s Law and the exploding computational demands of new machine learning applications. These trends are leading to a rapidly growing gap between application demands and single-node performance which leaves us with no choice but to distribute these applications. Unfortunately, developing distributed applications is extremely hard, as it requires world-class experts. To make distributed computing easy, we have developed Ray, a framework for building and running general-purpose distributed applications.
We introduce Data Orchestration Hub, a management service that makes it easy to build an analytics or machine learning platform on data sources across regions to unify data lakes. Easy to use wizards connect compute engines, such as Presto or Spark, to data sources across data centers or from a public cloud to a private data center. In this session, you will witness the use of “The Hub” to connect a compute cluster in the cloud with data sources on-premises using Alluxio. This new service allows you to build a hybrid cloud on your own, without the expertise needed to manage or configure Alluxio.
In this keynote, you will learn about the evolution of the global data platform at Rakuten spread across multiple regions, and clouds. In addition, you will hear about the journey across the years, and the use of data orchestration for multiple use cases.
Over the years, Alluxio has grown significantly to be the data orchestration framework for the cloud. The community developers and users have contributed a lot of effort and innovation to make Alluxio the system it is today. There are many users and companies deploying Alluxio at very large scale, and with the large scale, comes different types of challenges.
In this talk, I will introduce the high-level architecture of the current system, and present the various components of Alluxio. Also, I will discuss some of the main challenges of large scale Alluxio deployments, and the lessons we learned from those environments. This talk will detail some of the major scalability improvements added in the past several months, and how users can benefit from the changes.