ALLUXIO COMMUNITY OFFICE HOUR
Users deploy Alluxio in a wide range of use cases from analytics to AI platforms, for Alluxio’s unified access to data and transparent caching for acceleration. However, many frameworks are SQL engines, like Presto, Apache Spark SQL, or Apache Hive, and consume data structured as tables of rows and columns. Since Alluxio is commonly used as a filesystem of files and directories, there is a mismatch between how Alluxio exposes data (files, directories), and how SQL engines deal with data (tables, rows, columns). This gap creates various challenges and inefficiencies.
Therefore, in the Alluxio 2.1 release, we introduce Alluxio Structured Data Management, which is a new set of services that enables structured data applications to interact with data more efficiently. The new services include the catalog service and a transformation service, which all work together to bridge the gap between storage and SQL engines and enable physical data independence.
In this office hour, we introduce the concepts and components of Alluxio Structured Data Management, and go through a demo with Presto.
In this Office Hour we’ll go over:
- Introduction and motivation of Alluxio Structured Data Management
- Overview of the different services of Alluxio Structured Data Management in Alluxio 2.1
- A demo of using Alluxio Structured Data Management with Presto
ALLUXIO COMMUNITY OFFICE HOUR
Users deploy Alluxio in a wide range of use cases from analytics to AI platforms, for Alluxio’s unified access to data and transparent caching for acceleration. However, many frameworks are SQL engines, like Presto, Apache Spark SQL, or Apache Hive, and consume data structured as tables of rows and columns. Since Alluxio is commonly used as a filesystem of files and directories, there is a mismatch between how Alluxio exposes data (files, directories), and how SQL engines deal with data (tables, rows, columns). This gap creates various challenges and inefficiencies.
Therefore, in the Alluxio 2.1 release, we introduce Alluxio Structured Data Management, which is a new set of services that enables structured data applications to interact with data more efficiently. The new services include the catalog service and a transformation service, which all work together to bridge the gap between storage and SQL engines and enable physical data independence.
In this office hour, we introduce the concepts and components of Alluxio Structured Data Management, and go through a demo with Presto.
In this Office Hour we’ll go over:
- Introduction and motivation of Alluxio Structured Data Management
- Overview of the different services of Alluxio Structured Data Management in Alluxio 2.1
- A demo of using Alluxio Structured Data Management with Presto
Videos:
Presentation Slides:
Complete the form below to access the full overview:
.png)
Videos
Nilesh Agarwal, Co-founder & CTO at Inferless, shares insights on accelerating LLM inference in the cloud using Alluxio, tackling key bottlenecks like slow model weight loading from S3 and lengthy container startup time. Inferless uses Alluxio as a three-tier cache system that dramatically cuts model load time by 10x.

In this talk, Jingwen Ouyang, Senior Product Manager at Alluxio, will share how Alluxio make it easy to share and manage data from any storage to any compute engine in any environment with high performance and low cost for your model training, model inference, and model distribution workload.

Storing data as Parquet files on cloud object storage, such as AWS S3, has become prevalent not only for large-scale data lakes but also as lightweight feature stores for training and inference, or as document stores for Retrieval-Augmented Generation (RAG). However, querying petabyte-to-exabyte-scale data lakes directly from S3 remains notoriously slow, with latencies typically ranging from hundreds of milliseconds to several seconds.
In this webinar, David Zhu, Software Engineering Manager at Alluxio, will present the results of a joint collaboration between Alluxio and a leading SaaS and data infrastructure enterprise that explored leveraging Alluxio as a high-performance caching and acceleration layer atop AWS S3 for ultra-fast querying of Parquet files at PB scale.
David will share:
- How Alluxio delivers sub-millisecond Time-to-First-Byte (TTFB) for Parquet queries, comparable to S3 Express One Zone, without requiring specialized hardware, data format changes, or data migration from your existing data lake.
- The architecture that enables Alluxio’s throughput to scale linearly with cluster size, achieving one million queries per second on a modest 50-node deployment, surpassing S3 Express single-account throughput by 50x without latency degradation.
- Specifics on how Alluxio offloads partial Parquet read operations and reduces overhead, enabling direct, ultra-low-latency point queries in hundreds of microseconds and achieving a 1,000x performance gain over traditional S3 querying methods.
Speaker: David Zhu
David Zhu is a Software Engineer Manager at Alluxio. At Alluxio, David focuses on metadata management and end-to-end performance benchmarking and optimizations. Prior to that, David completed his Ph.D. from UC Berkeley, with a focus on distributed data management systems and operating systems for the data center. David also holds a Bachelor of Software Engineering from the University of Waterloo.