Today, we are thrilled to launch the Alluxio Enterprise AI product. One of the key innovations is the introduction of the next-generation architecture DORA – a Decentralized Object Repository Architecture. This blog talks about our development of the DORA architecture, including our motivation, design decisions, and implementation.
1. Moving from Data Analytics to the AI Workloads
Since the launch of Alluxio Enterprise Edition in 2016, we’ve continuously enhanced its functionality and performance for big data analytics workloads such as Spark, Presto, Trino, and Hive. Its semantics resemble a general-purpose distributed file system specifically tailored for these frameworks and their workloads. For example, Alluxio uses a RAFT-based distributed system for journal storage and ensures high availability. It provides the filesystem semantics by creating and maintaining an inode tree from a single master with strong consistency. Additionally, it has been optimized for sequential file operations, enhancing throughput for MapReduce and Spark workloads.
In recent years, there have been numerous trends challenging or redefining our original design assumptions and trade-offs. These include the inherent decoupling of storage and computation in Kubernetes environments, the transition from traditional HDFS to object stores and cloud storage, and the increasing dominance of column-oriented structured data storage in analytics workloads. A particularly notable opportunity and challenge is addressing the explosive growth in model training and serving, which can involve billions of files or more.
These shifts offer both unique opportunities and challenges for Alluxio’s current architecture. Specifically, using a centralized inode tree to manage the entire namespace on one master node can strain the Alluxio master, especially when the namespace grows. In addition, the current file system journaling can also become a performance bottleneck during restart or failovers when dealing with billions of files. Moreover, frequent access to columnar storage results in a high volume of seek operations. With the current optimizations employing large block sizes of 64-256MB, performance relies heavily on prefetching. However, aggressive prefetching can drive unnecessary read amplification and heightened network traffic loads.
As a result, we’re transitioning to a simpler, more scalable, and modular architecture. This shift will enable developers to iterate fast, reduce operational costs and ensure higher reliability.
2. The New Decentralized Object Repository Architecture
DORA, short for Decentralized Object Repository Architecture, is the next-generation architecture of Alluxio.
As a distributed caching storage system, DORA offers low latency, high throughput, and cost savings while aiming to provide a high-performance data access layer for AI workloads.
DORA leverages decentralized storage and metadata management to provide higher performance and availability, as well as pluggable data security and governance, enabling more scalability and efficient management of large-scale data access.
DORA’s architecture goal:
- Scalability: Scalability is a top priority for DORA, which needs to support billions of files to meet the demands of data-intensive applications, such as AI training.
- High Availability: DORA’s architecture is designed with high availability in mind, with 99.99% uptime and protection against single points of failure at the master level.
- Performance: Performance is also a key goal for DORA, which prioritizes the speed of model training, model serving, as well as GPU utilization for AI workloads.
The diagram below shows the architecture design of DORA, which consists of four major components: the service registry, scheduler, client, and worker.
- The service registry is responsible for service discovery and maintains a list of workers.
- The scheduler handles all asynchronous jobs, such as distributed load.
- The client runs inside the applications and includes a consistent hash algorithm to determine which worker to visit.
- The worker is the most important component, as it stores both metadata and data that are shared by a key, which is typically the path of the file.
3. Technical Highlights
3.1 Caching Data Affinity
The client obtains a list of all the DORA workers from a highly available service registry, such as Alluxio Master based on Raft or Kubernetes, which can support tens of thousands of Alluxio workers. The client then uses a consistent hashing algorithm to determine which worker to visit based on the file path as the key, ensuring that the same file always goes to the same worker for a maximum cache hit rate. As the service registry is not in the critical I/O path, it will not be a performance bottleneck.
In addition, DORA’s architecture allows for easy scalability by simply adding more nodes to the cluster. Each worker node can support tens of millions of files, making it easy to handle increasing data volumes and growing user bases.
3.2 Page Data Store
DORA utilizes a battle-tested page store module for cache storage, enabling more granular caching of small to medium read requests on large files. This reliable page store technology has been proven in applications like Presto at Meta, Uber, and TikTok. DORA’s fine-grained caching has reduced read amplification by 150 times and increased file position read performance by up to 15X.
3.3 Decentralized Metadata Store
DORA spreads metadata to every worker to ensure that metadata is always accessible and available. To optimize metadata access, DORA utilizes a two-level caching system for metadata entries. The first level of caching is the in-memory cache, which stores metadata entries in memory. This cache has a configurable maximum capacity and time-to-live (TTL) setting to set an expiration duration. The second level of caching is the persistent cache, which stores metadata entries on disk using RocksDB. The persistent cache has unbounded capacity, depending on available disk space, and also uses TTL-based cache eviction, avoiding any active sync or invalidation. The stored metadata is hashed by the full UFS path like the Page Store.
The combination of in-memory and persistent caching helps ensure that metadata is readily available and accessible, while also allowing for efficient use of system resources. The decentralization of metadata avoids the bottleneck in the architecture where metadata is primarily managed by the master nodes. With the ability to store up to 30 to 50 million files per DORA worker, the system can support large-scale data-intensive applications with billions of files.
3.4 Zero-copy Networking
DORA provides a Netty-based data transmission solution that offers a 30%-50% performance improvement over gRPC. This solution has several advantages, including fewer data copies through different thread pools, zero-copy transmission that avoids serialization of Protobuf, optimized off-heap memory usage that prevents OOM errors, and less data transfer due to the absence of additional HTTP headers.
3.5 Scheduler and Distributed Loader
Our scheduler provides an intuitive, extendable solution for efficient job scheduling, with consideration towards observability, scalability, and reliability. It has also been used to implement a distributed load capable of loading billions of files.
4. Benchmark Results
4.1 Creating and Reading Large Numbers of Files (per worker)
During a simple scalability test, DORA was tested for its ability to store and serve files on a single worker node without any performance regression. The test was conducted using three data points – 4.8 million files, 24 million files, and 48 million files. The worker was able to store and serve 48 million files without any significant performance downgrade.
4.2 Positioned Read on Structured Data
In single thread positioned reads, DORA and Alluxio 2.x demonstrated on-par latency for cold reads. However, DORA proved 10x faster than Alluxio 2.x for warm reads in this single thread scenario. During multi-threaded testing across 4 threads, gaps widened further – DORA exceeded Alluxio 2.x by over 3x for cold reads and by 20x for warm reads. This indicates substantial scalability gains from the position read approach as concurrent requests multiply. While cold read speed remains comparable between versions in single thread context, Alluxio DORA shows major warm read performance jumps both in individual thread and parallel/multi-thread workloads.
5. Conclusion and Future Works
In conclusion, DORA is a decentralized object repository architecture that offers low latency, high throughput, and cost savings while supporting AI workloads. The architecture is designed with scalability, high availability, and performance in mind, aiming to provide a unified data layer that can support billions of files.
DORA’s release marks a significant milestone in the evolution of Alluxio, enabling organizations to remain competitive in rapidly evolving markets.
We will continue to enhance DORA’s scalability, reliability, and performance. Additionally, we will explore further in the following areas:
- Further optimizations for cost-efficiency and storage efficiency
- Enhancing RESTful APIs for both data and metadata
- Better support for ETL and remote shuffle workloads.