Introducing DORA: The Next-generation Alluxio Architecture

October 18, 2023

Bin Fan

Today, we are thrilled to launch the Alluxio Enterprise AI product. One of the key innovations is the introduction of the next-generation architecture DORA - a Decentralized Object Repository Architecture. This blog talks about our development of the DORA architecture, including our motivation, design decisions, and implementation.

1. Moving from Data Analytics to the AI Workloads

Since the launch of Alluxio Enterprise Edition in 2016, we've continuously enhanced its functionality and performance for big data analytics workloads such as Spark, Presto, Trino, and Hive. Its semantics resemble a general-purpose distributed file system specifically tailored for these frameworks and their workloads. For example, Alluxio uses a RAFT-based distributed system for journal storage and ensures high availability. It provides the filesystem semantics by creating and maintaining an inode tree from a single master with strong consistency. Additionally, it has been optimized for sequential file operations, enhancing throughput for MapReduce and Spark workloads.

In recent years, there have been numerous trends challenging or redefining our original design assumptions and trade-offs. These include the inherent decoupling of storage and computation in Kubernetes environments, the transition from traditional HDFS to object stores and cloud storage, and the increasing dominance of column-oriented structured data storage in analytics workloads. A particularly notable opportunity and challenge is addressing the explosive growth in model training and serving, which can involve billions of files or more.

These shifts offer both unique opportunities and challenges for Alluxio's current architecture. Specifically, using a centralized inode tree to manage the entire namespace on one master node can strain the Alluxio master, especially when the namespace grows. In addition, the current file system journaling can also become a performance bottleneck during restart or failovers when dealing with billions of files. Moreover, frequent access to columnar storage results in a high volume of seek operations. With the current optimizations employing large block sizes of 64-256MB, performance relies heavily on prefetching. However, aggressive prefetching can drive unnecessary read amplification and heightened network traffic loads.

As a result, we're transitioning to a simpler, more scalable, and modular architecture. This shift will enable developers to iterate fast, reduce operational costs and ensure higher reliability.

2. The New Decentralized Object Repository Architecture

DORA, short for Decentralized Object Repository Architecture, is the next-generation architecture of Alluxio.

As a distributed caching storage system, DORA offers low latency, high throughput, and cost savings while aiming to provide a high-performance data access layer for AI workloads.

DORA leverages decentralized storage and metadata management to provide higher performance and availability, as well as pluggable data security and governance, enabling more scalability and efficient management of large-scale data access.

DORA’s architecture goal:

Scalability: Scalability is a top priority for DORA, which needs to support billions of files to meet the demands of data-intensive applications, such as AI training.
High Availability: DORA's architecture is designed with high availability in mind, with 99.99% uptime and protection against single points of failure at the master level.
Performance: Performance is also a key goal for DORA, which prioritizes the speed of model training, model serving, as well as GPU utilization for AI workloads.

The diagram below shows the architecture design of DORA, which consists of four major components: the service registry, scheduler, client, and worker.

The service registry is responsible for service discovery and maintains a list of workers.
The scheduler handles all asynchronous jobs, such as distributed load.
The client runs inside the applications and includes a consistent hash algorithm to determine which worker to visit.
The worker is the most important component, as it stores both metadata and data that are shared by a key, which is typically the path of the file.

3. Technical Highlights

3.1 Caching Data Affinity

The client obtains a list of all the DORA workers from a highly available service registry, such as Alluxio Master based on Raft or Kubernetes, which can support tens of thousands of Alluxio workers. The client then uses a consistent hashing algorithm to determine which worker to visit based on the file path as the key, ensuring that the same file always goes to the same worker for a maximum cache hit rate. As the service registry is not in the critical I/O path, it will not be a performance bottleneck.

In addition, DORA's architecture allows for easy scalability by simply adding more nodes to the cluster. Each worker node can support tens of millions of files, making it easy to handle increasing data volumes and growing user bases.

3.2 Page Data Store

DORA utilizes a battle-tested page store module for cache storage, enabling more granular caching of small to medium read requests on large files. This reliable page store technology has been proven in applications like Presto at Meta, Uber, and TikTok. DORA's fine-grained caching has reduced read amplification by 150 times and increased file position read performance by up to 15X.

3.3 Decentralized Metadata Store

DORA spreads metadata to every worker to ensure that metadata is always accessible and available. To optimize metadata access, DORA utilizes a two-level caching system for metadata entries. The first level of caching is the in-memory cache, which stores metadata entries in memory. This cache has a configurable maximum capacity and time-to-live (TTL) setting to set an expiration duration. The second level of caching is the persistent cache, which stores metadata entries on disk using RocksDB. The persistent cache has unbounded capacity, depending on available disk space, and also uses TTL-based cache eviction, avoiding any active sync or invalidation. The stored metadata is hashed by the full UFS path like the Page Store.

The combination of in-memory and persistent caching helps ensure that metadata is readily available and accessible, while also allowing for efficient use of system resources. The decentralization of metadata avoids the bottleneck in the architecture where metadata is primarily managed by the master nodes. With the ability to store up to 30 to 50 million files per DORA worker, the system can support large-scale data-intensive applications with billions of files.

3.4 Zero-copy Networking

DORA provides a Netty-based data transmission solution that offers a 30%-50% performance improvement over gRPC. This solution has several advantages, including fewer data copies through different thread pools, zero-copy transmission that avoids serialization of Protobuf, optimized off-heap memory usage that prevents OOM errors, and less data transfer due to the absence of additional HTTP headers.

3.5 Scheduler and Distributed Loader

Our scheduler provides an intuitive, extendable solution for efficient job scheduling, with consideration towards observability, scalability, and reliability. It has also been used to implement a distributed load capable of loading billions of files.

4. Benchmark Results

4.1 Creating and Reading Large Numbers of Files (per worker)

During a simple scalability test, DORA was tested for its ability to store and serve files on a single worker node without any performance regression. The test was conducted using three data points - 4.8 million files, 24 million files, and 48 million files. The worker was able to store and serve 48 million files without any significant performance downgrade.

4.2 Positioned Read on Structured Data

In single thread positioned reads, DORA and Alluxio 2.x demonstrated on-par latency for cold reads. However, DORA proved 10x faster than Alluxio 2.x for warm reads in this single thread scenario. During multi-threaded testing across 4 threads, gaps widened further - DORA exceeded Alluxio 2.x by over 3x for cold reads and by 20x for warm reads. This indicates substantial scalability gains from the position read approach as concurrent requests multiply. While cold read speed remains comparable between versions in single thread context, Alluxio DORA shows major warm read performance jumps both in individual thread and parallel/multi-thread workloads.

5. Conclusion and Future Works

In conclusion, DORA is a decentralized object repository architecture that offers low latency, high throughput, and cost savings while supporting AI workloads. The architecture is designed with scalability, high availability, and performance in mind, aiming to provide a unified data layer that can support billions of files.

DORA's release marks a significant milestone in the evolution of Alluxio, enabling organizations to remain competitive in rapidly evolving markets.

We will continue to enhance DORA's scalability, reliability, and performance. Additionally, we will explore further in the following areas:

Further optimizations for cost-efficiency and storage efficiency
Enhancing RESTful APIs for both data and metadata
Better support for ETL and remote shuffle workloads.

We welcome everyone to download the trial edition of Alluxio Enterprise AI. Also, please join us on the Alluxio community Slack for further discussions.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo