Meet in the Middle for a 1,000x Performance Boost Querying Parquet Files on Petabyte-Scale Data Lakes
Author(s):
David Zhu
,
Software Engineer Manager
at
Alluxio
Yuyang Wang
,
Senior Software Engineer
at
Alluxio
Bin Fan
,
VP of Technology
at
Alluxio

X

Complete the form below to access the full whitepaper:

Storing data as Parquet files on cloud object storage, such as AWS S3, has become prevalent not only for large-scale data lakes but also as lightweight feature stores for training and inference, or as document stores for Retrieval-Augmented Generation (RAG). However, querying petabyte- to exabyte-scale data lakes directly from S3 remains notoriously slow, with latencies typically ranging from hundreds of milliseconds to several seconds.

This article introduces how to leverage Alluxio as a high-performance caching and acceleration layer atop hyperscale data lakes for queries on Parquet files. Without using specialized hardware, changing data formats or object addressing schemes, or migrating data from data lakes, Alluxio delivers sub-millisecond Time-to-First-Byte (TTFB) performance comparable to AWS S3 Express One Zone. Furthermore, Alluxio’s throughput scales linearly with cluster size; a modest 50-node deployment can achieve one million queries per second, surpassing the single-account throughput of S3 Express by 50× without latency degradation.

In addition to serving Parquet files efficiently at the byte level, Alluxio offloads partial Parquet read operations from query engines through a pluggable interface, significantly reducing overhead associated with file parsing and index lookups. Consequently, Alluxio enables direct, ultra-low-latency point queries on Parquet files, achieving query latencies of hundreds of microseconds per query and 3,000 queries per second on a single thread—representing a 1,000x performance gain over traditional methods of querying Parquet files stored on S3.

1 Introduction

Today, it is extremely common for data-driven organizations to store and serve Parquet files directly on cloud object storage for good reasons:

  • Efficient Structured Data Access: Storing structured data in columnar formats such as Parquet significantly accelerates analytical queries, establishing Parquet as the default data lake format for OLAP workloads.
  • Scalable and Cost-Effective Storage: Cloud object storage solutions, notably AWS S3, have become the preferred storage backend for data lakes due to their low cost, reliability, durability, and, importantly, their capability to manage data at scales ranging from terabytes (TB) to exabytes (EB). These solutions excel at delivering high throughput, making them ideal for throughput-sensitive tasks such as large-scale analytics.

However, emerging latency-sensitive AI applications increasingly challenge this paradigm of directly querying Parquet files from cloud object stores. Two particularly demanding categories are Retrieval-Augmented Generation (RAG) and AI feature store workloads, both requiring sub-millisecond retrieval. Standard S3 services are not designed for such low-latency access. While premium solutions like S3 Express offer millisecond-level latency, they are five times more expensive than standard S3 and impose strict per-account throughput limits, rendering them prohibitive for large-scale deployments.

A common workaround for this limitation is caching data closer to compute engines. Many data lake query engines, such as Trino and Presto, employ internal caching to minimize repeated data fetches from object stores by using compute-local storage resources. However, these solutions face inherent limitations---they only scale up to the local storage capacity and cannot be effectively shared across applications (see Speed Up Presto at Uber with Alluxio Local Cache and Trino and Alluxio Documentation).

We introduce Alluxio as a high-performance intermediate layer specifically to bridge the latency gap. Positioned “in the middle” between compute and storage, Alluxio finds a sweet spot in the design space balancing latency, scalability, and cost.  More importantly, Alluxio’s role as a caching layer provides great flexibility to offload and execute predicate push-down operations closer to the hot data instead of the entire datalake.

Through architectural co-design, system engineering, and workload-specific optimizations, Alluxio demonstrates the feasibility of achieving more than 1,000× performance improvements for critical latency-sensitive workloads at cloud scale over querying Parquet files directly on AWS S3. We made the following contributions:

  • Single-Node Optimization: using a single Alluxio worker instance and its S3 API delivers sub-millisecond Time-to-First-Byte (TTFB) latency, achieving a 100x improvement over standard S3 while matching S3 Express performance.

  • Scalable Distributed Layer: Alluxio supports horizontal scaling, efficiently handling petabyte-scale datasets without compromising latency.

  • Computation Offloading: By selectively offloading partial Parquet operations from query engines to Alluxio, we investigate how computation can be effectively distributed and what interfaces are required. This provides an additional 100x improvement in end-to-end point query performance.

2 Target Workload and Design Objectives

This section characterizes the core workload focused on in this article: executing point lookup queries over large-scale, partitioned Parquet files with sub-millisecond latency.

Schema 

The target tables are stored as Parquet files in AWS S3, partitioned and sorted by a unique id column. Query clients operate within the same AWS region to minimize network latency. These datasets typically contain structured records such as user or product metadata, where the id commonly represents a user or product identifier.

Query Pattern

The primary query of interest is:

SELECT id, data FROM table WHERE id = 123

The combination of a sorted primary key and Parquet's row group and page-level statistics allows for efficient skipping of irrelevant data.

Challenges

Although this query appears simple, executing it with low latency over cloud object storage introduces several challenges:

  • While S3 offers excellent scalability, querying petabyte-scale datasets typically incurs high latency, with first byte fetch times averaging hundreds of milliseconds. Performing such a lookup also involves several roundtrips to the Parquet file / dataset summing up to seconds.

  • In-memory solutions such as Redis provide much lower latency but are prohibitively expensive at petabyte scale and require explicit data loading, introducing operational overhead. Data loading is also compute intensive, as it involves reading data from the long-term storage formats like parquet and writing them as key-value pairs in redis cache.

  • Even with services like S3 Express that provide sub-millisecond GetObject latency for small objects, achieving end-to-end query latency under 1 millisecond requires careful system-level and architectural optimization. Moreover, although S3 Express delivers excellent latency, it is 5× more expensive than standard S3 and imposes per-account throughput limits.

System Design Objectives

Alluxio is designed to simultaneously meet four key objectives: low latency, high throughput, high reliability, and low cost. While existing systems may satisfy some of these goals individually, few achieve all simultaneously.

  1. Low Latency: Deliver sub-millisecond average query latency using predicate and projection pushdown, zero-copy data transfer, a single-RPC execution path, and metadata caching.

  2. High Throughput: Support scalable query throughput exceeding one million queries per second as compute and storage nodes scale horizontally.

  3. High Reliability: Ensure data availability and fault tolerance under failure conditions, targeting availability comparable to S3 SLAs.

  4. Low Cost: Minimize memory and compute overhead. For reference, S3 Express One Zone costs $160 per TiB per month compared to $23 for standard S3. Alluxio aims to deliver high performance without requiring changes to data formats or application logic, thus avoiding costly data format conversions and application changes.

To meet these objectives, Alluxio incorporates the following optimizations:

  • Eagerly caching data with zero-copy transfer: By colocating data with computation, we reduce transfer time and avoid egress costs when accessing data across S3 regions.

  • Predicate and projection pushdown: Filters data prior to transfer, saving both network bandwidth and end-to-end query time.

  • Single RPC execution path: Collapses multiple request steps into a single RPC, significantly reducing round-trip latency.

  • Metadata caching: Caches optimal representation of Parquet metadata on Alluxio worker node memory to avoid repeated reads and parsing, accelerating the lookup.

3 Reducing Baseline Latency Using Distributed Caching

The first step toward unlocking sub-millisecond query latency is eliminating the costly round-trips (tens to hundreds millisecond) for every Parquet file access on S3. However, serving petabyte-scale Parquet datasets introduces a core challenge: keeping everything in memory (e.g., via Redis) is prohibitively expensive and doesn’t scale.

Instead, we designed Alluxio to intelligently cache high-value data on local NVMe SSDs as illustrated below in Figure 1, layered transparently over S3. This architecture delivers the best of both worlds — the low latency of fast local storage for frequently accessed data, and the scalability, durability, and cost-efficiency of cloud object storage for the full dataset.

Figure 1 Alluxio Architecture: Clients use consistent hashing on the file path to determine the responsible Alluxio worker node. The selected node handles the query, serving cached data if available or fetching it from the underlying object store (e.g., S3) via GetObject.

As shown in Figure 1, Alluxio breaks Parquet files into 4MB-aligned pages, enabling fine-grained, sub-file caching. This allows Alluxio to dramatically reduce both storage footprint and data transfer time, bringing frequently accessed data like Parquet footers and hot column groups closer to compute.

Each Alluxio worker automatically keeps the hottest data local using an LRU-based caching policy, while seamlessly falling back to S3 for cold data. Missing data is fetched on demand and cached asynchronously in the background — delivering the performance of local storage without sacrificing the scalability and economics of cloud object storage.

To deliver consistent low-latency performance, Alluxio uses client-side consistent hashing for data sharding and request routing. Clients calculate exactly which Alluxio worker holds the data — no centralized metadata lookup, no extra hops. Cluster membership changes are handled dynamically using etcd, ensuring fully distributed, stateless, and resilient architecture ready for modern cloud environments.

Alluxio can preload petabyte-scale datasets in parallel across a fleet of worker nodes, maximizing throughput and minimizing time-to-ready. And with a fully S3-compatible API, Alluxio plugs directly into existing applications — no changes to data formats or query logic required.

To further reduce the latency, Alluxio worker incorporates several system-level optimizations:

  • Asynchronous Event Loop:
    Each Alluxio worker is built on a high-performance, asynchronous I/O framework. This enables non-blocking I/O with minimal context switching and thread contention—two major contributors to latency in traditional blocking I/O systems. Its event-driven model allows one worker instance to scale to thousands of concurrent connections while maintaining sub-millisecond responsiveness.

  • Off-Heap Page Storage on NVMe:
    Alluxio leverages NVMe SSDs to store cached pages off-heap. This design allows for significantly higher storage density without overwhelming memory resources, offering a favorable balance between cost and access latency.

  • Zero-Copy I/O:
    To avoid unnecessary memory copies and to reduce CPU load, Alluxio employs zero-copy I/O techniques using sendfile() and mmap(). These allow cached pages to be read directly from NVMe and transmitted over the network stack without copying through user space, enhancing both throughput and latency.

Together, these optimizations allow Alluxio to bridge the performance gap between AWS S3's scalability and the low-latency demands of modern data-intensive applications. A single Alluxio worker instance running on NVMe achieves latency on par with S3 Express and up to 100× faster than S3 Standard. Throughput of a single Alluxio worker instance approaches the S3 Express per-account limit and scales linearly with additional nodes.

4 Improving Point-Lookup Query Latency Using Pushdown

The Parquet format [4] is designed with built-in indexes and rich metadata to enable fast SELECT queries. In highly selective queries, predicate and projection pushdown are commonly used to leverage metadata and reduce the amount of data read and transferred. This section explores how we offload pushdown operations to Alluxio, eliminating most round-trips, minimizing data movement, and accelerating point queries to achieve true sub-millisecond latency.

Pushdown with a Twist

If you run a point query naively — without any pushdown optimization — the engine will fetch the entire Parquet file from S3, load it into memory, and then scan through it to find the specific row with id = 123. This approach is clearly inefficient, especially when Parquet files can be hundreds of megabytes to gigabytes in size.

The good news is that most modern SQL engines support predicate and projection pushdown — techniques that leverage Parquet’s metadata to locate and read only the relevant parts of the file. In the case of a point query like SELECT id, data FROM table WHERE id = 123, pushdown allows the engine to avoid reading the entire file and instead fetch just the row or page containing id = 123. This is the typical approach we observe from many users today when querying Parquet files directly on S3, and we treat this as the baseline for our comparison..

In a cloud data lake, however, pushdown alone is often not enough to achieve low-latency queries. This is because the engine and the storage layer (e.g., AWS S3) are physically far apart, connected over the network with non-trivial latencies. Even with pushdown enabled, executing a point query still requires multiple remote GetObject calls: to read the Parquet footer, extract row group and page-level statistics, fetch indexes, and finally locate the target data page. Each of these operations involves random I/O patterns and remote RPCs to S3, turning what should be fast metadata lookups into a series of expensive round-trips. Figure 2 illustrates how these metadata structures are laid out within a typical Parquet file.

Figure 2: Parquet Layout from Apache Parquet [1] In traditional databases, these pointer-chasing operations are local—between CPU, memory, and disk. In a data lake, these turn into remote, multi-hop RPC calls with random I/O patterns, making low latency unfeasible. These operations, originally optimized for local disk access in traditional databases, become significant sources of latency in modern data lake architectures.

To minimize unnecessary data movement and accelerate point queries end-to-end, we extend pushdown techniques in Alluxio by executing parts of the predicate and projection logic directly on the worker nodes—offloading filtering and projection operations to where the data is physically cached. This eliminates the need for multiple remote lookups across the Parquet file structure, reducing both latency and network overhead.

For optimal performance, the Alluxio workers, clients, and the underlying S3 bucket are all deployed within the same AWS region and Availability Zone. In our experiments, client nodes were provisioned using  m6i.4xlarge instances, while Alluxio workers ran on i4i.2xlarge instances equipped with NVMe SSDs. The table itself is a 500MB parquet file with 12 columns of mostly binary types. The largest column contains about 15KB of data per row, and the column is part of the result of the select query.

We start by establishing the baseline performance: running a simple SELECT query with pushdown enabled directly against Parquet files on S3. This results in a query latency of 411ms.

Next, by caching the Parquet file in Alluxio, query latency drops to 232ms. This reduction comes primarily from avoiding S3 round-trip latencies and leveraging Alluxio’s NVMe cache. However, this setup still incurs multiple round-trips between the client and Alluxio for reading Parquet metadata and locating the relevant data pages.

To further optimize, we push predicate and projection evaluation entirely down to the Alluxio workers. In this setup, the client sends the query to the Alluxio worker, which performs filtering locally—either from the OS page cache or directly from NVMe storage. This transforms the query into a single network round-trip, returning only the filtered result set, and reduces latency to 42ms.

Profiling the remaining latency reveals that parsing and deserializing Parquet metadata had become the dominant cost. To address this, we applied the same caching strategy used for data pages to metadata — storing the Parquet footer, column index, and offset index in a deserialized form within Alluxio. This final optimization reduces end-to-end query latency to just 0.297ms, as shown in Figure 3.

Figure 3: Point Query Latency. The baseline represents running a point query with pushdown enabled on Parquet files stored in S3. Single RPC Pushdown refers to pushing both predicate and projection evaluation entirely down to the Alluxio workers.

Rule of Thumb to Meet in the Middle

While we demonstrated these techniques primarily in the context of point queries, the broader architectural principle — moving lightweight computation closer to the data — can benefit many other workloads beyond simple lookups. However, this approach is not universally applicable. Not all workloads are suitable for offloading to the caching layer.

A general rule of thumb: computation should only be pushed down to the cache when the compute cost is low and the task is primarily I/O-bound. This is because storage nodes — including Alluxio workers — are optimized for fast data access but often have limited CPU resources.

Typical candidates for pushdown include:

  • I/O-bound maintenance tasks: Operations like Iceberg metadata cleanup, compaction, or tombstone pruning, where the work is dominated by reading and writing data rather than heavy computation.
  • Lightweight query operators: Simple database operations like point lookups or range scans with highly selective filters benefit significantly from predicate and projection pushdown. Aggregations such as COUNT or SUM are also good candidates, provided the operation is trivial and the output size is small. More compute-intensive operations like GROUP BY with large cardinality or complex aggregations should remain in the query engine, where compute resources are more abundant.

In short, the goal is to push down just enough logic to avoid unnecessary data movement — but not so much that it overwhelms the storage nodes with expensive computation.

5 Cost Comparison of Low Latency Storage Solutions

Achieving low latency at a low cost is one of the key benefits of using a caching solution like Alluxio. The low cost is derived from:

  1. Automatic, on-demand data retrieval and caching without additional, often manual, steps to copy data from the higher-latency persistent store to the cache. 
  2. Only retrieving and caching data that is needed by the workload significantly lowers the total capacity needed.

    By working with a multitude of customers with low-latency requirements, we have found that most workloads only utilize 20% of the data available in the persistent store. Alluxio, therefore, only needs to provision enough NVMe capacity to cache 20% of the total data, resulting in much lower storage costs.

Let’s look at an example and compare the total cost of deploying S3 Express and Alluxio to achieve low latency.

In this example, a 500 TB dataset uses S3 Standard as the high reliability persistent data store. We compare the cost of using S3 Express One Zone and Alluxio distributed caching as temporary, low-latency data stores. We chose an i3en.metal EC2 instance with 4 x 7.5 TB NVMe drives for a total capacity of 30 TB, which provides more than enough storage for 20% of the total dataset size.

As you can see from this comparison below, caching data with Alluxio saves over $40,000/month compared to S3 Express One Zone, while delivering the same low-latency performance. Said another way, S3 Express One Zone costs over 4X more to achieve the same low-latency.

* At the time of writing, S3 Express One Zone has a list price of $110/TB/Month.
** At the time of writing, on demand pricing for EC2 i3en.12xlarge instances with 30TB of NVMe capacity was $5.42/hour which calculates to $132/TB/Month.
*** At the time of writing, S3 Standard has a list price of $23/TB/Month.

6 Related Work

Caching data and metadata The concept of caching data closer to the compute framework has gained widespread adoption within the data lake ecosystem. However, most of these systems are caching specific to a particular compute engine, limiting the reuse of cached content across different computational frameworks.

Alternatively,  Prammmer et.al.[3] have proposed storing metadata independently of the data file to facilitate better file-based caching. This approach would reduce the format-specific metadata caching logic in the cache server, but it necessitates modifications to the widely used data format and the conversion of a substantial amount of data from existing formats. Furthermore, this solution does not address the performance bottleneck caused by the serialization and deserialization of metadata content from the disk, which becomes a key bottleneck after both data and metadata are cached. 

Computational Storage Previously, storage vendors have proposed the notion of "Computational Storage"[2]. The idea is that data processing happens at the storage device level, minimizing data movement between storage and compute, thereby improving performance and enabling real-time data analysis. We share the same principle, but the implementation is quite different. Previous approaches to computational storage often involved custom controllers or coprocessors on the storage nodes to facilitate the processing of data. The offloaded tasks tend to be simpler and data-oriented because they operate at the abstraction below the file system level. Alluxio takes advantage of the inherent need to cache data in a disaggregated system, and harvests the computational power that otherwise would be idle. In addition, it has full visibility into the application layer, so it has the choice to perform tasks at the application level (query filtering), file system level(compaction), or byte level (block encryption). 

Pushdown in Compute Engines Predicate and projection pushdown have been a popular technique to reduce data transferred in other compute engines. Alluxio is unique in that we do not need multiple round trips to pinpoint the data because we can pushdown the logic to where the data is located. This was not possible in previous systems such as Spark or Trino, making them unsuitable for ultra low latency systems.

7 Conclusion and Future Work

Together, these enhancements enable millions of requests per minute on a single server with sub-millisecond latency, even at petabyte scale. Our results demonstrate a practical and cost-effective path for building object storage systems that meet the low-latency requirements of modern data-intensive applications.

We also identify several promising directions for future research and system improvements:

  • Scalability Across Large Partitioned Tables:
    Generalize the low-latency performance demonstrated here to large partitioned tables, where each partition may span multiple storage nodes due to its size.

  • Extending to Maintenance Workloads:
    Apply the meet-in-the-middle technique to background maintenance tasks such as Iceberg compaction and metadata cleanup, which are often I/O-bound.

  • Handling Dynamic Data:
    Investigate techniques to maintain low latency in data lakes that undergo frequent updates and invalidations, ensuring data freshness without performance degradation.

  • End-to-End AI Integration:
    Implement and evaluate this approach in a real-world AI inference pipeline, such as Retrieval-Augmented Generation (RAG), to measure end-to-end gains in latency and throughput.

Reference

[1] Apache Parquet. 2023. Apache Parquet File Layout (Image). https://parquet.apache.org/images/FileLayout.gif. Accessed: 2025-04-04.

[2] Antonio Barbalace and Jaeyoung Do. [n. d.]. Computational Storage: Where Are We Today? ([n. d.]).

[3] Martin Prammer, Xinyu Zeng, Ruijun Meng, Wes McKinney, Huanchen Zhang, Andrew Pavlo, and Jignesh M Patel. 2025. Towards Functional Decomposition of Storage Formats. (2025).

[4] Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. 2023. An Empirical Evaluation of Columnar. Storage Formats. Proceedings of the VLDB Endowment 17, 2 (Oct. 2023), 148–161. https://doi.org/10.14778/3626292.3626298

Share this post
No items found.

White Paper

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer