Tencent Case Study: Delivering Customized News to Over 100 Million Users per Month with Alluxio

April 8, 2018

This post is guest authored from our friends at Tencent: Can He

Download or print the case study here

Tencent is one of the largest technology companies in the world and a leader in multiple sectors such as social networking, gaming, e-commerce, mobile and web portal. Tencent News, one of Tencent’s many offerings, strives to create a rich, timely news application to provide users with an efficient, high-quality reading experience. To provide the best experience to more than 100 million monthly active users of Tencent News, we leverage Alluxio with Apache Spark to create a scalable, robust, and performant architecture.

Our goal at Tencent News is to deliver the best experience for every user, and that requires our jobs to complete on the order of seconds. Prior to adopting Alluxio, we ran Spark jobs in a computation cluster of 150 dedicated servers and pulled data from an HDFS cluster served outside this computation cluster in our data center.

This architecture ensures our Spark jobs can exclusively leverage the dedicated computation resource for performance isolation. On the other hand, it also creates data access challenges when the data to pull is large or the machines/network are under high load. During peak times when the data required by a Spark data processing job increases in size, or there is a heavy workload on the cluster, Spark jobs are not able to guarantee the ability to cache the data as RDD in memory and thus read data from disk due to resource contention.

This results in slow job completion, or worse, failures which take a long time to relaunch the job and reload the data, both of which are unacceptable for our business requirements for customer experience.

**Figure 1: Previous Generation Architecture**

As a result, we needed a data solution to ensure that Spark jobs can read data with high and stable performance. We surveyed the available technologies, and found Alluxio to be the missing piece of our architecture.

We deploy Alluxio as the in-memory data layer for Spark jobs, on each machine of the computation cluster running Spark jobs, and change Spark jobs to read from or write to Alluxio instead of caching data inside Spark processes. The benefit is two-fold:

By decoupling storage from computation, it enabled Alluxio to store the data pulled from HDFS when the data is accessed for the first time, then serve the data locally on the nodes where Spark compute was running. As a result performance is much higher than previously while also providing SLA guarantee that meets our stringent requirements on job completion time.
Alluxio deployment can be scaled up or down dynamically according to the available memory resource. Changing this layer between compute and storage is transparent to application and independent from the size of the data Spark needed to compute.

**Figure 2: Next Generation Architecture Expanded to Over 600 Nodes with Alluxio**

With this new architecture, we now have a highly scalable, predictable, and performant platform for serving the critical missions of Tencent News to our user base. We have been running Alluxio on over 600 nodes and we plan to expand the footprint further.

Share this post

Blog

How Blackout Power Trading Achieved Multi-Join Double-Digit Millisecond Latency Offline Feature Store Performance with Alluxio Low Latency Caching

In this blog, Greg Lindstrom, Vice President of ML Trading at Blackout Power Trading, an electricity trading firm in North American power markets, shares how they leverage Alluxio to power their offline feature store. This approach delivers multi-join query performance in the double-digit millisecond range, while maintaining the cost and durability benefits of Amazon S3 for persistent storage. As a result, they achieved a 22 to 37x reduction in large-join query latency for training and a 37 to 83x reduction in large-join query latency for inference.

‍

Alluxio AI 3.7: Now with Sub-Millisecond Latency!

Super Boosting Your Agentic AI & Inference Workloads

‍

Alluxio Demonstrates Strong Performance in MLPerf Storage v2.0 Benchmarks

In the latest MLPerf Storage v2.0 benchmarks, Alluxio demonstrated how distributed caching accelerates I/O for AI training and checkpointing workloads, achieving up to 99.57% GPU utilization across multiple workloads that typically suffer from underutilized GPU resources caused by I/O bottlenecks.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo

Alluxio Enterprise AI

Alluxio Enterprise Data

Blog

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer