Building a Large-scale Interactive SQL Query Engine using Presto and Alluxio in JD.com

September 25, 2019

As the largest online and offline retailer in China, JD.com is running a data platform with more than 40,000 nodes, running more than 1 million jobs per day, managing over 650PB of data. One of the most important services on this data platform is OLAP serving more than 500,000 queries every day for data analysis across different businesses. This article describes how JD built this interactive OLAP platform combining two open-source technologies: Presto and Alluxio.

Driven by the fast-growing business requirements, JD engineers have implemented the first large-scale enterprise-level interactive query engine in China. To achieve this, we built the largest known Presto cluster in China based on JDPresto leveraging its low latency and high concurrency. On top of that, we deployed Alluxio with Presto as a fault-tolerant, pluggable caching service to reduce network traffic for low-latency data access. By combining Presto and Alluxio, we improved ad-hoc query latency by 10x. This stack of JDPresto on Alluxio has been running on more than 400 nodes in production for more than 2 years, covering from mall APP to WeChat mobile QQ, to offline data analysis platform. It has significantly improved the user experience for over 1 million users and helped with more precise marketing across tens of thousands of merchants on JD.

1. Platform Architecture and Its Challenges

Figure 1 shows the architecture of JD's Big Data Platform:

The entire cluster has over 40,000 servers where 25,000 plus nodes are running batch processing jobs serving more than 13,000 users;
There are more than 1 million batch processing jobs, processing greater than 40PB data daily;
The total data volume exceeds 650PB, with daily data increase surpassing 800TB, and the total storage capacity reaches 1EB;
More than 40 business products with over 450 data models are running on this platform.

On this platform, HDFS is the foundation serving data to the entire platform with data pipelines using different computation frameworks orchestrated by YARN. Due to their enormous scale, we have observed issues in achieving good data locality which significantly impacts the performance of jobs running on Preso when reading from HDFS. As a result, we need a bridge between the Presto and HDFS but isolating the resource and failures.

Figure 2 shows how deploying Alluxio helps Presto workers read data more efficiently. As shown on the left, before using Alluxio, JDPresto workers experience low I/O performance when reading the input data from HDFS for two reasons:

HDFS is serving data several magnitudes larger than the amount of data consumed by each individual Presto cluster;
The platform is overutilized to the point where YARN is unable to schedule Presto jobs on its local HDFS datanote. As a result, Presto queries end up reading data from remote HDFS datanode and experience delay due to network latency.

After deploying Alluxio workers together with Presto workers (as shown on the right in Figure 2), Presto initially will read data from HDFS, likely a remote datanode, but the data will be cached on a local Alluxio worker on the same node. The cached data can avoid or eliminate remote read overhead from HDFS for subsequent data access. In short, Alluxio guarantees query performance by decoupling Presto performance from HDFS because data locality is achieved.

We also modified the way to split read paths in JDPresto correspondingly as shown in Figure 3. Normally, Presto first checks if the input data is in Alluxio. In case Alluxio service is unavailable (e.g., down for maintenance), Presto can directly access HDFS without any user intervention. We also extended Alluxio to provide explicit consistency verification against HDFS.

2. Performance Evaluation

We benchmark two Presto clusters with different configurations but running the same SQL query multiple times. The results viewed in the terminals are shown in Figure 4. On the left, JDPresto directly accesses HDFS as the baseline whereas, on the right, JDPresto uses Alluxio as a distributed cache backed by HDFS.

The numbers in Figure 4 in red boxes represent the execution time. For both configurations, the first query reads directly from HDFS. This initial query is slower on the right side because Alluxio is caching tables and partition files. In subsequent queries, Alluxio speeds up the execution time by at least 10x.

Figure 5 summarizes the results of this test. A total of 6 comparison queries were performed in this test. The green line represents JDPresto on HDFS and the yellow line represents the stack of JDPresto on Alluxio. The X-axis represents the index of queries, and the Y-axis represents the completion time of these queries in seconds. We can see that JDPresto-on-Alluxio can reduce the read time after the first read, much faster than the JDPresto cluster.

3. Summary

In order to achieve the optimization described in this section, we have done a lot of work including extending Alluxio and JDPresto and developing some testing tools. In addition, we have done some work around App-on-Yarn. At the same time, we are actively researching and evaluating and applying Alluxio to other computing frameworks. In building and maintaining our big data platform, JD’s team of big data engineers also made many contributions to the Alluxio open source community.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo