Accelerating Analytics by 200% with Impala, Alluxio, and HDFS at Tencent

June 22, 2020

Honghan Tian

In this article, Honghan Tian describes how engineers in the Data Service Center (DSC) at Tencent PCG (Platform and Content Business Group) leverages Alluxio to optimize the analytics performance and minimize the operating costs in building Tencent Beacon Growing, a real-time data analytics platform.

Specifically, by using Alluxio as a distributed caching layer, the HDFS scanning performance for Impala SQL queries is accelerated by 2x out of the box without further optimizations.

Tencent’s Big Data Analytic App

Beacon Growing is an analytics platform that provides analyzers insights of user behaviors in realtime. Beacon comes with both built-in and customized analytic models, such as RETENT, FUNNEL, PATH, etc. Combining the functions of A/B testing and user profiles, developers can select promotional plans more easily.

Under the hood, Beacon is powered by a stack of open source technologies. Particularly for the context of this article, we used Impala to query data in Parquet format stored in HDFS. Here is the overview of the software stack currently being deployed by Beacon, to give the readers a better understanding of the system.

ComponentVersionAlluxio2.2.0Apache Impala3.4Parquet-MR1.11.0HDFS3.xZookeeper3.5.6

Challenges Before

In Tencent, we build systems to handle data at the Internet-scale. For example, the scan range of a typical query can be 100 billion rows, with hundreds of concurrent queries. To speed up the queries, we have applied optimizations like page index in Parquet files to reduce the I/O cost in scan operations. In practice, there were still a few issues observed:

I/O Bottleneck under high concurrency: When more than 100 queries are running concurrently and scanning HDFS, we observed that the usage of host disk was fixed at 100%.
Expensive writes: Insert operations such as “SORT BY” during table creation are very slow due to files merge under HDFS-parquet sort. Note that in our application, “SORT BY” helps meet query SLAs without over partitioning the table.

Why Alluxio

Based on the above observations, we decided to build a caching layer using SSDs. However, HDFS is not designed to leverage heterogeneous storage. We deployed Alluxio to leverage the speed advantage of SSDs and the capacity advantage of HDDs. The new architecture is shown in the following figure:

There are a few changes and improvements observed after integrating Alluxio into Tencent Beacon infrastructure:

Better Resource Utilization: The range of the average load of HDFS was significantly reduced from 80% ~ 100% down to 50% ~ 70%. The I/O time is also reduced by 50% or more.
Higher Query Performance: With faster data retrieval, I/O-intensive queries are sped up by 244% with a median improvement of 121% across all queries.
Higher Cluster stability: Alluxio greatly reduces the I/O consumption of hotspot and large queries to the entire HDFS cluster, reducing the overall query failure rate by more than 5%
Reduced Timeout Failure Rate: Faster execution also reduces the query time in waiting-queue, resulting in 29% reduction in the timeout query failure rate.

One of the most valuable features of Alluxio in our use case is Unified Namespace. It helps share the same namespace across multiple clusters, making it easier for Impala to access the data resources of multiple HDFS clusters.

The following table is a comparison of some data：

time-out SQLBefore accelerationAfter accelerationratio19363s166s1.1813514s162s2.178155s70s1.21343s37s0.1614316s22s13.316311s115s1.7015169s119s0.42

Next steps: Locality Groups

There are a few on-going projects on the horizon to integrate Alluxio at Tencent PCG.

First, we plan to further reduce the operating work by deploying a large Alluxio cluster with worker processes spanning multiple Impala clusters. In other words, Alluxio and Impala processes will be co-located at the same server to reduce network traffic. We are currently experiencing some issues and working with the core team to address them.

Second, we want to leverage the Alluxio tiered locality group to further improve Impala performance and eliminate remote reads of Alluxio workers with improved data locality. Specifically, with one but much larger Alluxio cluster deployed, we will divide Alluxio workers nodes into different ‘locality groups’. As a result, when queries read data from different HDFS clusters, data will be cached into the nearest locality group to reduce I/O cost. This feature is an analogy of the “layer” configuration” in Clickhouse.

Third, once locality grouping described above is achieved, locally deployed Impala computing groups can be used to elastically expand or shrink computing resources based on different Impala clusters for different business uses.

Conclusion

In our project “Beacon Growing”, we have deployed Alluxio to improve Impala performance by 2.44x for IO intensive queries and 1.20x for all queries. The query failure rate due to timeout is also reduced by 29%. In the future, we foresee it can reduce disk utilization by over 20% for our planned elastic computing on Impala.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo