Millions Saved Annually: Unleashing the Power of Alluxio HDFS at Uber

May 29, 2023

Bin Fan

In October 2022, Uber's Presto team shared in a blog post using the Alluxio SDK cache to boost Presto query performance and cost efficiency. This achievement is a major milestone in the collaboration between Alluxio and Uber. Thus far, the Uber Presto team has implemented the Alluxio SDK cache in three production clusters spanning over 1500 nodes. With the Alluxio SDK cache, Uber has observed a 10% decrease in data read traffic to their HDFS cluster and a 50% reduction in input read latency, leading to faster insights for Uber’s business.

Recently, we've taken another exciting step forward. Uber's HDFS team has posted another blog post detailing our joint project aimed at optimizing the performance of HDFS DataNodes. The project utilized the Alluxio SDK cache to manage an SSD storage on each DataNode, resulting in improved performance and a better return on investment. Despite the SSD cache occupying only 0.6% of the total disk space, it impressively handles 60% of the overall client traffic. You can read the full story on Uber's Engineering blog: Optimizing HDFS with DataNode Local Cache.

The blog post provides a deep dive into Uber's efforts to optimize their Hadoop Distributed File System (HDFS) deployment, one of the largest in the world, housing exabytes of data across tens of clusters. The primary objective was to strike a balance between efficiency, service reliability, and high performance as they scale their data infrastructure.

Uber's strategy involved adopting higher-density HDD (16+TB) SKUs to replace the existing 4TB HDDs that were still in use by the majority of their HDFS clusters. This move was projected to save tens of millions of dollars annually. However, the adoption of high-density disk SKU presented challenges, particularly with disk IO bandwidth.

To address this, Uber implemented a read-only SSD cache within each DataNode to store frequently accessed data and serve read requests. This new feature, deployed alongside the 16TB HDD SKUs in production, significantly reduced IO workload on HDDs, took over up to 60% of traffic from HDD disks, doubled read performance, and reduced the chance of process blocking on read by about one-third.

The blog post also outlines several challenges encountered during the adoption of the local SSD cache, including cache hit rate, write race condition, SSD write endurance, failure handling, and production readiness validation.

For the implementation, Uber leveraged the Alluxio SDK cache for performance and efficiency. Alluxio serves as a data caching layer and provides Hadoop-compatible file system APIs that seamlessly integrate with Hadoop-compatible compute engines. Alluxio also implements Java standard file I/O APIs, enabling smooth integration with HDFS.

The blog post concludes with a discussion on future work, including traffic/dataset tier-based caching and performance optimization.

Overall, the (Alluxio) local caching solution has proven to be effective in improving HDFS’s efficiency and enhancing the overall performance.
– Chen Liang, Jing Zhao, Yangjun Zhang, Junyan Guo, Fengnan Li @ Uber Engineering

At Alluxio, we're proud of our strides in collaboration with Uber and are eager to continue working together on future projects. Our shared success story is a testament to the power of innovative solutions and strong partnerships. We look forward to more groundbreaking collaborations, pushing the boundaries of data performance and driving the future of efficient data management together.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo