Building a Large-scale Interactive SQL Query Engine using Presto and Alluxio in

As the largest online and offline retailer in China, is running a data platform with more than 40,000 nodes, running more than 1 million jobs per day, managing over 650PB of data. One of the most important services on this data platform is OLAP serving more than 500,000 queries every day for data analysis across different businesses. This article describes how JD built this interactive OLAP platform combining two open-source technologies: Presto and Alluxio.

Driven by the fast-growing business requirements, JD engineers have implemented the first large-scale enterprise-level interactive query engine in China. To achieve this, we built the largest known Presto cluster in China based on JDPresto leveraging its low latency and high concurrency. On top of that, we deployed Alluxio with Presto as a fault-tolerant, pluggable caching service to reduce network traffic for low-latency data access.

By combining Presto and Alluxio, we improved ad-hoc query latency by 10x. This stack of JDPresto on Alluxio has been running on more than 400 nodes in production for more than 2 years, covering from mall APP to WeChat mobile QQ, to offline data analysis platform. It has significantly improved the user experience for over 1 million users and helped with more precise marketing across tens of thousands of merchants on JD.

1.  Platform Architecture and Its Challenges

Figure 1 shows the architecture of JD’s Big Data Platform:

  • The entire cluster has over 40,000 servers where 25,000 plus nodes are running batch processing jobs serving more than 13,000 users;
  • There are more than 1 million batch processing jobs, processing greater than 40PB data daily;
  • The total data volume exceeds 650PB, with daily data increase surpassing 800TB, and the total storage capacity reaches 1EB;
  • More than 40 business products with over 450 data models are running on this platform.
Figure 1

On this platform, HDFS is the foundation serving data to the entire platform with data pipelines using different computation frameworks orchestrated by YARN. Due to their enormous scale, we have observed issues in achieving good data locality which significantly impacts the performance of jobs running on Preso when reading from HDFS. As a result, we need a bridge between the Presto and HDFS but isolating the resource and failures.

Figure 2

Figure 2 shows how deploying Alluxio helps Presto workers read data more efficiently. As shown on left, before using Alluxio, JDPresto workers experience low I/O performance when reading the input data from HDFS for two reasons: (1) HDFS serving the data is running at a much larger scale (tens of thousands) compared to the scale of each individual Presto cluster (typically a few hundred); (2) our platform too very busy for YARN to schedule Presto jobs on HDFS datanodes with input data local. As a result, Presto queries have a high chance of reading data from remote HDFS datanodes and experience delay due to network latency. 

After deploying Alluxio workers together with Presto workers (as shown on the right in Figure 2), Presto can first read data from HDFS (likely a remote datanode), and then cache data from Alluxio worker on the same node. The cached data can avoid or eliminate remote read overhead from HDFS for subsequent data access. In short, Alluxio brings more data locality in our environment and makes Presto performance independent of HDFS, which guarantees query performance.

We also modified the way to read splits in JDPresto correspondingly as shown in Figure 3. Normally, Presto first checks and reads if the input data is in Alluxio. In case Alluxio service is unavailable (e.g., down for maintenance), Presto can skip Alluxio and directly access HDFS which is transparent to users. We also extended Alluxio to provide explicit consistency verification between Alluxio and HDFS.

Figure 3

2. Performance Evaluation

Our benchmark two Presto clusters with different configurations but running the same SQL query multiple times. The results viewed in the terminals are shown in Figure 4. On the left, JDPresto directly accesses HDFS as the baseline whereas, on the right, JDPresto uses Alluxio as a distributed cache backed by HDFS.

Figure 4

The numbers in Figure 4 in red boxes represent the execution time. On the right side, reading HDFS directly is much faster than the left side. This is because Alluxio caches tables and partition files. After the first access, Alluxio can speed up the query by 10x or more.

Figure 5

Figure 5 summarizes the results of this test. A total of 6 comparison queries were performed in this test. The green line represents JDPresto on HDFS and the yellow line represents the stack of  JDPresto on Alluxio. The X-axis represents the index of queries, and the represents the completion time of these queries in seconds. We can see that JDPresto-on-Alluxio can reduce the read time after the first read, much faster than the JDPresto cluster.

3. Summary

In order to achieve the optimization described in this section, we have done a lot of work including extending Alluxio and JDPresto and developing some testing tools. In addition, we have done some work around App-on-Yarn. At the same time, we are actively researching and evaluating and applying Alluxio to other computing frameworks. In building and maintaining our big data platform, JD’s team of big data engineers also made many contributions to the Alluxio open source community.