Millions Saved Annually: Unleashing the Power of Alluxio + HDFS at Uber

In October 2022, Uber’s Presto team shared in a blog post using the Alluxio SDK cache to boost Presto query performance and cost efficiency. This achievement is a major milestone in the collaboration between Alluxio and Uber. Thus far, the Uber Presto team has implemented the Alluxio SDK cache in three production clusters spanning over 1500 nodes. With the Alluxio SDK cache, Uber has observed a 10% decrease in data read traffic to their HDFS cluster and a 50% reduction in input read latency, leading to faster insights for Uber’s business.

Recently, we’ve taken another exciting step forward. Uber’s HDFS team has posted another blog post detailing our joint project aimed at optimizing the performance of HDFS DataNodes. The project utilized the Alluxio SDK cache to manage an SSD storage on each DataNode, resulting in improved performance and a better return on investment. Despite the SSD cache occupying only 0.6% of the total disk space, it impressively handles 60% of the overall client traffic. You can read the full story on Uber’s Engineering blog: Optimizing HDFS with DataNode Local Cache.

The blog post provides a deep dive into Uber’s efforts to optimize their Hadoop Distributed File System (HDFS) deployment, one of the largest in the world, housing exabytes of data across tens of clusters. The primary objective was to strike a balance between efficiency, service reliability, and high performance as they scale their data infrastructure.

Uber’s strategy involved adopting higher-density HDD (16+TB) SKUs to replace the existing 4TB HDDs that were still in use by the majority of their HDFS clusters. This move was projected to save tens of millions of dollars annually. However, the adoption of high-density disk SKU presented challenges, particularly with disk IO bandwidth.

To address this, Uber implemented a read-only SSD cache within each DataNode to store frequently accessed data and serve read requests. This new feature, deployed alongside the 16TB HDD SKUs in production, significantly reduced IO workload on HDDs, took over up to 60% of traffic from HDD disks, doubled read performance, and reduced the chance of process blocking on read by about one-third.

The blog post also outlines several challenges encountered during the adoption of the local SSD cache, including cache hit rate, write race condition, SSD write endurance, failure handling, and production readiness validation.

For the implementation, Uber leveraged the Alluxio SDK cache for performance and efficiency. Alluxio serves as a data caching layer and provides Hadoop-compatible file system APIs that seamlessly integrate with Hadoop-compatible compute engines. Alluxio also implements Java standard file I/O APIs, enabling smooth integration with HDFS.

The blog post concludes with a discussion on future work, including traffic/dataset tier-based caching and performance optimization. 

Overall, the (Alluxio) local caching solution has proven to be effective in improving HDFS’s efficiency and enhancing the overall performance.
– Chen Liang, Jing Zhao, Yangjun Zhang, Junyan Guo, Fengnan Li @ Uber Engineering

At Alluxio, we’re proud of our strides in collaboration with Uber and are eager to continue working together on future projects. Our shared success story is a testament to the power of innovative solutions and strong partnerships. We look forward to more groundbreaking collaborations, pushing the boundaries of data performance and driving the future of efficient data management together.