Running Presto/Trino in a Hybrid Cloud Architecture

July 20, 2020

Adit Madan

Migrating SQL workloads from a fully on-premise environment to cloud infrastructure has numerous benefits, including alleviating resource contention and reducing costs by paying for computation resources on an on-demand basis. In the case of Presto running on data stored in HDFS, the separation of compute in the cloud and storage on-premises is apparent since Presto’s architecture enables the storage and compute components to operate independently. The critical issue in this hybrid environment of Presto in the cloud retrieving HDFS data from an on-premise environment is the network latency between the two clusters.

This crucial bottleneck severely limits the performance of any workload since a significant portion of its time is spent transferring the requested data between networks that could be residing in geographically disparate locations. As a result, most companies copy their data into a cloud environment and maintain that duplicate data, also known as Lift and Shift. Companies with compliance and data sovereignty requirements may even prevent organizations from copying data into the cloud. This approach is not scalable and requires introducing a lot of manual effort to achieve reasonable results. This article introduces Alluxio to serve as a data access layer to help serve data to Presto efficiently, as opposed to either directly querying the distant HDFS cluster or manually providing a localized copy of the data to Presto in a cloud cluster.

Hybrid Cloud Architecture with Alluxio and Presto

In the following architecture diagram, both Presto and Alluxio processes are co-located in the cloud cluster. As far as Presto is concerned, it is querying for and writing data to Alluxio as if it were a co-located HDFS cluster. When Alluxio receives a request for data, it fetches the data from the remote HDFS cluster initially, but subsequent requests will be served directly from its cache. When Presto sends data to be persisted into storage, Alluxio asynchronously writes data to HDFS, freeing the Presto workload from needing to wait for the remote write to complete. In both read and write scenarios, with the exception of the initial read, a Presto workload is able to run at the same, if not faster, performance as if it were in the same network as the HDFS cluster. Note that besides the deployment and configuration of Alluxio and establishing the connection between Presto and Alluxio, there is no additional configuration or other manual efforts needed to maintain the hybrid environment.

Benchmarking Performance

For benchmarking, we run SQL queries on data in a geographically separated Hive and HDFS cluster.

The hybrid cloud environment used for experimentation in this section includes two Amazon EMR clusters in different AWS regions. Because the two clusters are geographically dispersed, there is noticeable network latency between the clusters. VPC peering is used to create VPC connections to allow traffic between the two VPCs over the global AWS backbone with no bandwidth bottleneck. Readers can follow the tutorial in the whitepaper to reproduce the benchmark results if using AWS as the cloud provider.

We used data and queries from the industry standard TPC-DS benchmark for decision support systems that examine large amounts of data and answer business questions. The queries can be categorized into the following classes (according to visualizations in this repository): Reporting, Interactive, and Deep Analytics.

With Alluxio, we collected two numbers for all TPC-DS queries; denoted by Cold and Warm.

Cold is the case where data is not loaded in Alluxio storage before the query is run. In this case, Alluxio fetches data from HDFS on-demand during query execution.
Warm is the case where data is loaded into Alluxio storage after the Cold run. Subsequent queries accessing the same data do not communicate with HDFS.

With HDFS, we collected two numbers as well; Local and Remote.

Local is the case where Presto and HDFS are co-located in the same region. This number shows us the performance of running the compute on-premises when data is local without bursting into the cloud.
Remote is the case where Presto reads from storage in another region.

TPC-DS Data Specificiations

Scale FactorFormatCompressionData SizeNumber of Files1000ParquetSnappy463.5 GB234.2 K

EMR Instance Specificiations

Instance TypeMaster Instance CountWorker Instance CountAlluxio Storage Volume (us-west-1)HDFS Storage Volume (ap-southeast-1)r5.4xlarge1 each10 eachNVMe SSDEBS

We compared the performance of Presto with Alluxio (Cold and Warm) with Presto directly on HDFS (Local and Remote). Benchmarking shows an average of 3x improvement in performance with Alluxio when the cache is warm over accessing HDFS data remotely.

The following table summarizes the results by class. Overall the maximum improvement seen with Alluxio was for q9 (7.1x) and the minimum was for q39a (1x - no difference).

Query Class: Reporting Max Improvement: q27 (3.1x) Min Improvement: q43 (2.7x)

Query Class: Interactive Max Improvement: q73 (3.9x) Min Improvement: q98 (2.2x)

Query Class: Deep Analytics Max Improvement: q34 (4.2x) Min Improvement: q59 (1.9x)

With a 10 node compute cluster, the peak bandwidth utilization throughout running all the queries remained under 2Gbps when accessing data from the geographically separated cluster. Bandwidth was not the bottleneck with the AWS backbone network. As the utilization scales with the size of the compute cluster, a bandwidth bottleneck could be expected for larger clusters when not using Alluxio since the bandwidth available with Direct Connect may be limited.

Most of the performance gain seen with Alluxio is explained by the latency difference for both metadata and data, when cached seamlessly into the localized Alluxio cluster.

Conclusion

A hybrid cloud architecture allows cloud computing resources to be used for data analytics, even if the data resides in a completely different network. In addition to achieving significantly better performance, the execution plan outlined does not require any significant reconfiguration of the on-premise infrastructure. Since users can harness the compute power of a public cloud, this opens up more opportunities for Presto to be utilized as a scalable and performant compute framework for analytics using data stored on-premises.

An in-depth whitepaper, “Zero-Copy” Hybrid Cloud for Data Analytics - Strategy, Architecture, and Benchmark Report, was originally published by Alluxio on the Alluxio Engineering Blog on April 6, 2020. Check out the blogs for more articles about Alluxio’s engineering work and join Alluxio Open Source community on Slack for any questions you might have.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo