This blog post delves into the history behind Trino introducing Alluxio as a replacement for RubiX as a file system cache. It explores the synergy between Trino and Alluxio, assesses which type of cache best suits various needs, and shares real-world examples of Trino and Alluxio adoption.
Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino was designed to handle data warehousing, ETL, and interactive analytics by large amounts of data and producing reports. These workloads are often classified as Online Analytical Processing (OLAP).
Alluxio is a high performance data access layer for large-scale analytics and AI. Alluxio sits between compute frameworks (such as Trino and Apache Spark) and various storage systems (like Amazon S3, Google Cloud Storage, HDFS, and MinIO). Alluxio provides distributed caching, unified namespace, simplifying data access across data centers, regions, or clouds.
1. RubiX is OUT, Alluxio is IN
Diagram 1: RubiX is OUT, Alluxio is IN (source)
RubiX, Qubole’s open-source lightweight data caching framework, was previously integrated into the Trino Hive connector to enable storage caching for Hive. However, over time, the open-source project became stale due to a lack of active maintenance and it was no longer adequate for modern use.
In March 2024, Trino announced a significant update to its object storage file system caching: “Catalogs using the Delta Lake, Hive, Iceberg, and soon Hudi connectors now get to access performance benefits from the new Alluxio-powered file system caching.”
The motivation for integrating Alluxio with Trino and the Hive connector was initially presented at Trino Summit 2023 – Trino optimization with distributed caching on data lakes by Alluxio developers. We shared an overview of data caching, illustrating how large-scale companies like Uber and Shopee leveraged data caching to achieve substantial performance improvements. We then dived into the technical intricacies of cache invalidation, maximizing cache hit rates, cluster elasticity, cache storage efficiency, and data consistency.
Following months of collaboration and discussion between the Alluxio and Trino communities and contributors, the PR for file system caching support was merged and Alluxio caching was officially added to Trino starting from version 429.
To incorporate file system caching into your catalogs using Delta Lake, Hive, or Iceberg connectors, refer to Trino’s documentation on File System Cache.
2. Trino + Alluxio
2.1 Why Do We Need Caching for Trino
A small fraction of the petabytes of data you store is generating business value at any given time. Repeatedly scanning the same data and transferring it over the network consumes time, compute cycles, and resources. This issue is compounded when pulling data from disparate Trino clusters across regions or clouds. In these circumstances, caching solutions can significantly reduce the latency and cost of your queries.
2.2 Trino on Alluxio
Alluxio connects Trino to various storage systems, providing APIs and a unified namespace for data-driven applications. Alluxio allows Trino to access data regardless of the data source and transparently cache frequently accessed data (e.g., tables commonly used) into Alluxio distributed storage.
Diagram 2: Alluxio’s Architecture (source)
Alluxio provides a number of benefits as a distributed cache for Trino, including:
- Sharing cache data between Trino workers: By deploying a standalone Alluxio cluster, cache data can be shared among all the Trino workers, regardless of whether they belong to different Trino clusters. This allows for efficient use of cached capacity.
- Sharing cache data across Trino clusters and other compute engines: This allows for faster read and write operations when using Spark and Trino for ETL processing and SQL queries. A common use case is when the data written by Spark is read by Trino, utilizing the shared Alluxio cache can speed up these operations.
- Resilient caching: Scaling down Trino without losing cached data in Alluxio can be achieved by utilizing Alluxio’s data replication mechanism. This allows customers to use spot instances for Trino, potentially reducing costs, while ensuring that cached data is always available and the risk of data loss is minimized, even in case of failures.
- Reduce I/O access latency: By co-locating Alluxio workers with Trino workers, data locality is improved and I/O access latency is reduced when storage systems are remote or the network is slow or congested. This is because the data is stored and processed within the same physical location, allowing for faster and more efficient access to the cached data.
- Unified data management: Alluxio’s unified API and global namespace can simplify data management for Trino workloads by providing a single interface to access data stored in different systems, regions, or clouds. This can reduce complexity and improve data accessibility, enabling Trino to access data stored in various locations as if it were stored in a single place.
- Data consistency: Alluxio utilizes a metadata-sync mechanism to keep the data stored in cache up to date, ensuring that the latest data is always easily accessible. This feature eliminates the need to manually update the cache and enables Trino to access the most current data, improving the performance and accuracy of the workloads.
Overall, Alluxio provides a comprehensive data caching solution for Trino workloads.
3. Which Cache Fits Your Needs?
Maintainers
Actively maintained by Alluxio and Trino community
Actively maintained by Alluxio Community
Availability
Since Trino 439 and onwards
Available since 2015
Deployment
A library in Trino worker process
A standalone service running on independent processes
Cache Capacity
Leverage local disk NVMe or memory, also bound to local disks capacity
Cache capacity scales horizontally
Cache Sharing
Only accessible to the local Trino workers process for cached data
Cached data shareable across Trino clusters, as well as Spark and other frameworks
APIs
TrinoFileSystem internal to Trino
HadoopFileSystem, S3, POSIX (GA), Python FSSpec (experimental)
Jianjian Xie, Staff Software Engineer at Alluxio and an active open-source contributor of Alluxio and Trino, provided an in-depth exploration of Trino data caching strategies. He presented the latest test results and discussed the multi-leveling caching architecture that accelerates Trino’s performance by 10x for data lakes of any scale.
View the recording and slides here.
4. Real-world Use Cases
4.1 Unifying Cross-region Data Lake at Expedia Group
Expedia Group needed to have the ability to query cross-region S3 while simplifying the interface to their local data sources. They implemented Alluxio to federate cross-region data lakes in AWS. Alluxio unifies geo-distributed data silos without replication, enabling consistent and high performance with ~50% reduced costs.
Diagram 3: Expedia Group Data Lake (source)
Read more here: Unify Data Lakes Across Multi-Regions in the Cloud.
4.2 Using Alluxio to Power Analytics Queries at Razorpay
Razorpay is a large fintech company in India. Razorpay provides a payment solution that offers a fast, affordable, and secure way to accept and disburse payments online. On the engineering side, the availability and scalability of analytics infrastructure are crucial to providing seamless experiences to internal and external users. Razorpay’s Trino + Alluxio clusters handle 650 daily active users and 100k queries/day with 60 sec P90 and 130 sec P95 query latencies and a 97% success rate.
Diagram 4: Razorpay Data Platform (source)
Check out their blog post here: How Trino and Alluxio Power Analytics at Razorpay.
4.3 Dune: Reducing Query Cost and Query Runtimes of Trino
Jonas Irgens Kylling (Software Engineer) and Florent Delannoy (Infrastructure Engineer) of Dune Analytics spearheaded the initiative for file system caching support, building upon the initial PR created by an Alluxio developer. Dune Analytics engineers dedicated numerous hours to coding, reviewing, testing, and engaging in extensive discussions on Slack.
At Trino Fest 2024 this summer, Jonas delivered a presentation on Reducing query cost and query runtimes of Trino powered analytics platforms. He shared key optimizations they implemented to enhance query performance and user experience, including leveraging file system caching with Alluxio, advanced cluster management techniques, and methods for storing, sampling, and filtering query results.
The outcomes of implementing file system caching were impressive! The team at Dune achieved approximately a 20% acceleration of TPC query execution for benchmarks and production workloads, 30% increase in speed of the analysis phase of TPC query execution for Iceberg tables, and a remarkable 70% reduction in S3 GET requests.
Diagram 5: File system caching results (source)
To watch Jonas’ presentation, here is the video recording and slides.
5. Learn More
To learn more about Alluxio, join our growing Alluxio Community Slack to ask any questions and provide your feedback: https://alluxio.io/slack.
To learn more about the Trino community, go to https://trino.io/community.html.