Alluxio Architecture: A Decentralized Data Acceleration Layer for the AI Era
Author(s):
Bin Fan
,
VP of Technology
at
Alluxio
Haoyuan Li
,
Founder & CEO
at
Alluxio

X

Complete the form below to access the full whitepaper:

In this whitepaper, you will learn how Alluxio's Decentralized Object Repository Architecture (DORA) drives sub-ms latency and terabytes per second of throughput to power large-scale AI training, deployment, and inference workloads in single cloud and multi-cloud deployments.

1. Abstract

This white paper presents the Alluxio architecture, a cloud-native Data Acceleration Layer built to bridge the gap between high-performance GPU computing and distributed cloud storage. Alluxio addresses the critical I/O and data-mobility challenges faced by modern AI infrastructure, where compute performance has far surpassed data access capabilities.

The Decentralized Object Repository Architecture (DORA) design eliminates centralized bottlenecks through fully decentralized metadata and caching, enabling sub-millisecond latency, TB/s throughput, and 97–98% GPU utilization across large-scale, multi-cloud environments. Performance results on GPU cloud environments demonstrate that Alluxio achieves 10 GiB/s per server (100Gbps NIC), <1 ms latency, and delivers FSx-level performance at one-third the cost.

“Bring Data Close to Compute — Anywhere, Any Cloud.”

2. Motivation and Problem Statement

Modern AI infrastructure faces data challenges as compute (GPUs) performance rapidly outpaces data access capabilities.

  • Data is not plug-and-play for GPUs. Data Gravity makes it difficult to achieve a plug-and-play experience for AI researchers and engineers. Workloads are often deployed wherever GPU capacity is more available or less expensive, while datasets remain bound to specific storage regions. Migrating or moving data on demand is often required but time consuming, error prone, and expensive. [ Watch Customer Tech Talk ] AI researchers and scientists don’t want to deal with issues like “too large to fit local dist; unstable I/O; wrong URLs or permission issues”.

  • High I/O Requirement for GPUs. During training, a single GPU may require I/O speed of more than 1 GB/s to sustain computations; and for a large-scale GPU cluster (e.g., thousand of GPUs) the peak throughput demand can reach terabytes per second. Slow data delivery directly translates into wasted GPU cycles and millions of dollars in unused compute capacity.

  • Massive Number of Files in Datasets Training multimodal models may require billions of files and related metadata entries. Common in traditional distributed file systems, centralized metadata services likely become the first point of slowness or failure, and ultimately a scalability barrier.

These challenges call for an Easy, Fast, and Scalable data access solution that enables users to focus on training, deploying, and serving AI models without worrying about data configuration or migration.

In short, Alluxio aims to enable researchers and engineers to access data wherever their compute runs — simply mount existing cloud bucket(s) as local folders but with local NVMe performance and start accessing data immediately without any migration.

Why Existing Solutions Fall Short

There are many data solutions in the ecosystem, but none address all three dimensions of scalability, simplicity, and cloud mobility:

  • Single-Node CLI Tools (e.g., s3fs, gcsfs): Convenient for mounting object storage on a single node, but lack distribution, sharing, and concurrency across clusters.

  • HPC Storage Systems (e.g., Lustre, GPFS, VastData, Weka, etc): Deliver excellent performance, but remain heavyweight, and often expensive data silos. They do not solve data gravity or cross-cloud accessibility.

  • Managed Cloud Caching  (e.g., AWS FSx for Lustre, GCP Anywhere Cache): Closer to Alluxio’s goal in performance and target user-experience, but require dedicated provisioning, are bound to a single cloud, and lack a lightweight software-only footprint.

Alluxio’s approach, on the other hand, is a software-defined, cloud-native data acceleration layer that complements existing object storage instead of replacing it. It brings performance, caching, and rich semantics on top of your existing buckets—anywhere.

Goals

Alluxio focuses on three dominant workload categories:

  1. Large-Scale Model Training & Deployment

    High-throughput, POSIX-based access for distributed AI workloads.

  2. Ultra-Low-Latency Feature Store / Agentic Memory on Cloud

    Sub-millisecond access to Parquet and PB-scale data lakes directly on cloud storage.

  3. Multi-Cloud Data Sharing and Synchronization

    Unified namespace and caching across regions and clouds.

Non-Goals
  • Alluxio is not intended to be a general-purpose file system; it does not aim to implement full traditional filesystem semantics, just the subset required for AI workloads in training and inferencing.
  • Alluxio does not provide native data durability; instead, it relies on underlying cloud object stores (e.g., S3, GCS, OCI Storage) for persistence.

3. Decentralized Architecture Overview

Alluxio Enterprise Edition adopts the Decentralized Object Repository Architecture or DORA. The key idea of DORA is to deliver the extreme scalability, high availability, and top performance required by modern AI workloads.

3.1 Centralized vs. Decentralized Metadata Service

When the Alluxio open-source project was launched in 2013, the system followed the classic HDFS/GFS-style master–worker model designed for big data workloads. In this model, the master maintains the global metadata directory including an inode tree and its journal (edit logs), while workers store cached data blocks in a distributed way. This architecture served analytics frameworks such as Spark, Presto, and Hive well, where workloads typically involved up to hundreds of millions of files and I/O patterns dominated by reads into sufficiently large files (tens to hundreds of megabytes).

As workloads shifted from large-scale batch analytics to AI training and multimodal data processing, the I/O access pattern changed dramatically. I/O for AI workloads are characterized by:

  • Billions of small files, often representing individual samples, embeddings, or feature shards.
  • Highly concurrent and bursty reads, driven by parallel GPUs or distributed data loaders.
  • Simplified access semantics, dominated by open–read–checkpoint cycles rather than complex renames, updates or appends.

These shifts in access patterns exposed the fundamental limitations of the centralized metadata service design:

  • Metadata Centralization:
    A single master typically maintains the entire inode tree and journal. Once the namespace scales beyond hundreds of millions of files, the master becomes a bottleneck. Restarting or replaying billions of journal entries can take hours.

  • I/O Path Inefficiency:
    Every metadata lookup goes through the master, adding network hops and round-trip latency. Under highl- parallel GPU training workloads, this centralized routing layer quickly throttles throughput.

  • Failover Complexity:
    Even with HA mechanisms such as RAFT-based replicated journals, leader elections and log replays cause multi-minute downtime—unacceptable for latency-sensitive AI environments.

Meanwhile, the simplification of AI I/O semantics—read-heavy workloads with weak cross-file consistency—opened up an opportunity:  the system no longer needs a centralized master to coordinate every operation. 

This combination of increased parallelism and reduced coordination requirements made a fully decentralized architecture both necessary and feasible. These observations ultimately led to the design of DORA — the Decentralized Object Repository Architecture.

3.2 Embracing Full Decentralization in the AI Era

These lessons led to a fundamental redesign: Alluxio is transformed from a centralized, master-orchestrated system into a fully decentralized, stateless caching layer for both data and metadata. 

In Alluxio Enterprise Edition 3.0 or later, no master node exists. Instead, all workers participate in a consistent hashing ring. Each worker acts as both a data and metadata cache for files within its assigned segments on the ring. Instead of contacting a master to obtain file metadata and then fetching data from a worker, the client connects directly to the appropriate worker using consistent hashing based on the file path. Cluster coordination and membership management are handled by lightweight service registries (e.g., etcd) rather than a master node. This architecture is illustrated as below:

Figure 1: DORA Architecture Overview — Clients connect directly to workers via consistent hashing; no centralized master exists.

Major components are listed:

Client

Embedded within applications or frameworks. Uses a consistent hashing algorithm to locate the responsible worker directly based on the file path, eliminating centralized routing or lookup tables.

Worker

The fundamental storage and caching unit. Each worker stores both data and metadata locally on NVMe, supports fine-grained page caching, and manages persistence using RocksDB. Workers operate independently and can join or leave dynamically.

Service Registry

Maintains cluster membership and service discovery. The default implementation uses etcd, which tracks active workers and their assigned data shards. It is not on the I/O path, ensuring no overhead on the critical path.

Coordinator

Manages background distributed tasks—such as prefetching, asynchronous loading, and replication—through a lightweight, stateless scheduling service. Provides observability and extensibility for custom job scheduling.

Together, these components form a decentralized, self-balancing architecture that eliminates single points of failure while scaling linearly with data volume and compute growth.

3.3 I/O Flow

Figure 1 illustrates the I/O flow in a typical I/O request to Alluxio.

  1. Application–Worker RPC
    When the application (e.g., training job on a GPU server) makes a file operation (e.g., open or read), the client consults the consistent hashing ring to locate the responsible worker. The hash function ensures stable mappings even as workers join or leave the cluster.
  2. Local Cache Access
    The selected worker first checks its local NVMe cache (called the Page Store) for the requested data block or page. Metadata for the same object is co-located on the same worker in RocksDB, ensuring constant-time lookups and zero dependency on external metadata services.
  3. Cache Miss and Data Fetch
    On a cache miss, the worker fetches data directly from the underlying object store (e.g., S3, OCI Object Storage, or OSS) using range GET requests. The retrieved pages are cached asynchronously in the background for subsequent access, minimizing stall time for clients.

Through this flow, data is always fetched from the worker on the ring (local cache on cache hit and remote storage on miss).

3.4 Design Principles Summary

The DORA architecture embodies three key design principles:

  1. Scalability – Metadata and data are fully decentralized by consistent hashing, removing global locks and central bottlenecks. Each worker can independently manage tens of millions of files, allowing near-linear horizontal scaling.

  2. High Availability – There is no SPOF (single point of failure) because there are no components like a single master or journal. Cluster coordination via etcd ensures continuous service even under node failures.

  3. Top Performance – Direct client-to-worker access and hash-based sharding enable sub-millisecond latency and TB/s throughput across distributed GPU clusters.

4. Worker: The Cache Engine

Each Alluxio Worker is the basic storage and caching engine in the DORA architecture. Each worker maintains a unique Worker ID, persisted locally on disk. Upon startup or restart, the worker re-registers with the service registry (etcd) using this ID to reclaim its corresponding position on the consistent hashing ring. Based on its assigned segment, the worker becomes responsible for both the metadata and cached data objects belonging to that segment.

Each worker manages a portion of local storage capacity, ideally on NVMe SSDs for high throughput and low latency. For files in its hash shards on the ring, data and metadata are both persisted locally, ensuring that a node restart does not result in cache loss.

Fine-Granularity Caching

Alluxio Workers use fine-grained caching (as opposed to entire-object caching). Each cached object is split into pages, typically ≤ 4 MB in size.

Caching at smaller granularity reduces read amplification but increases management overhead, while larger pages may waste space for small or partial reads with strong spatial locality (often seen when reading files such as Parquet, where footers and indexes are much more frequently accessed than the rest of the file).

Empirically, a 4 MB page size provides an optimal balance between cache efficiency and management overhead.

Cache Eviction

The page is also the smallest eviction unit. Each worker maintains LRU (Least Recently Used) queues to manage the page lifecycle. When local capacity is full, cold pages are evicted in LRU order to make space for new data.

Per-File Metadata Cache

In addition to data pages, each worker caches file-level metadata such as size, modification time, and permissions. As a result, metadata operations like fstat are served directly from the worker-side cache, avoiding RPCs to any central service.

Zero-Copy Data Transfer

The control path between clients and workers uses gRPC for metadata and coordination operations but not for the data path.

For data reads and writes, DORA employs a Netty-based zero-copy I/O pipeline, bypassing gRPC serialization and user-space buffering.

This reduces context switching and CPU overhead, improving throughput by 30–50% compared with traditional RPC-based data transfers.

In summary, each worker is a self-contained caching node that integrates:

  • Hash-based namespace sharding for scalability

  • Page-level caching and metadata colocation for efficiency

  • Persistent local storage for durability across restarts

  • Zero-copy networking for top-tier performance

5. Under File System (UFS): The Persistent Layer

Alluxio connects seamlessly with a wide range of existing storage systems—both cloud and on-premises—through its Under File System (UFS) abstraction. These connectors allow Alluxio to act as a data acceleration and unification layer on top of heterogeneous storage backends such as S3, GCS, HDFS, and NAS, without requiring any modification to existing data or infrastructure.

Each DORA worker interacts directly with its assigned UFS for data fetching and persistence. When data is first accessed, the worker retrieves it from the UFS and caches it locally on NVMe for subsequent high-speed access, while maintaining a consistent view of the namespace with the original source.

5.1 Supported Storage Systems

Alluxio provides native integrations for all major cloud object stores—including Amazon S3, Google Cloud Storage, Azure Blob Storage, Alibaba OSS, Tencent COS, and others—as well as on-premises systems such as HDFS, MinIO, Ceph, and NFS. Through the UFS layer, Alluxio exposes a unified file and object interface, automatically adapting to backend semantics such as eventual consistency, multipart upload, and authentication models. This design enables enterprises to combine multiple backends—such as linking multi-region S3 buckets and on-prem HDFS clusters—under a single Alluxio namespace.

5.2 UFS as the Source of Truth and Consistency Model

Within the DORA architecture, Alluxio serves as a stateless acceleration layer and the Under File System (UFS) serves as the persistent layer —the ultimate source of truth for all data. Alluxio’s cache provides high-speed, transient access, while the UFS guarantees durability and long-term consistency.

Alluxio maintains cache consistency through validation and synchronization optimized for performance-critical AI workloads:

  • Read-Through:
    When a file is first accessed, the worker retrieves it directly from the UFS and caches both data and metadata locally. Subsequent reads are served from the cache unless the UFS version has changed.

  • Write and Sync Behavior:
    Write operations follow a write-through or write-back (beta) policy depending on configuration:
    • Write-through immediately persists changes to the UFS, ensuring full durability.
    • Write-back batches and asynchronously flushes updates for higher performance, suitable for temporary or intermediate results.

In both cases, the UFS remains the single authoritative data store.

  • Configurable TTL and Refresh:
    Users can define Time-to-Live (TTL) to control how long cached data is trusted before revalidation.  Shorter TTL values ensure stronger consistency, while longer TTLs maximize performance for stable or read-only datasets such as model checkpoints and training data snapshots.

This model provides a flexible balance between correctness and performance, allowing Alluxio to safely accelerate read-heavy AI workloads—where datasets are immutable or rarely updated—while still preserving consistency for write-intensive applications.

In short, the UFS serves as the durable source of truth, and Alluxio’s cache layer maintains coherence through validation, TTL refresh, and policy-driven synchronization.

5.3 Key Takeaways

  • Seamless Integration: Alluxio connects with all major cloud and on-premises storage systems through UFS connectors.

  • Persistent Layer of Truth: UFS maintains durable, consistent data storage while Alluxio focuses on locality and acceleration.

6. User-facing: Multi-Protocol Access

Alluxio provides multiple interfaces for applications to access data, ensuring compatibility with a wide range of existing tools and frameworks. It also offers powerful features to optimize performance and ensure high availability. This paper provides an overview of the primary data access methods and related features.

Alluxio offers several ways for applications and users to interact with the data it manages:

  • POSIX API via FUSE: Mount Alluxio as a local filesystem, allowing any application or command-line tool (ls, cat, cp) to interact with Alluxio using standard file operations. This is the most common method for seamless integration with existing applications, especially for ML/AI training workloads.
  • S3 API: Expose an S3-compatible endpoint, allowing applications built with AWS S3 SDKs (like Python's boto3 or the Java S3 client) to connect to Alluxio. This is ideal for data science and ML workloads that are already integrated with S3.=
  • Python API via FSSpec: A Pythonic filesystem interface(alluxiofs) for developers using libraries like Pandas, PyArrow, and Ray. It provides a native and efficient way to interact with Alluxio within the Python ecosystem.

7 Fault Tolerance

Alluxio is designed to be highly resilient to failures. It has multiple mechanisms to ensure that I/O operations can continue gracefully even when components become unavailable in the following events:

Network Partition

If a client tries to read data from a worker and that worker is unavailable due to network issues, the client can automatically fallback to reading the data directly from the Under File System. This ensures the application's read request succeeds without interruption, even if the Alluxio cluster becomes unresponsive.

Worker Restart

Workers may need to restart due to hardware upgrades or maintenance. After restarting, the worker reloads its persisted Worker ID and reclaims its segment on the ring. Since both metadata and data pages are stored on local persistent storage (NVMe + RocksDB), cached data is automatically rediscovered and reused—no cold rebuild is required.

Hardware Failure

If a worker becomes unreachable for a configured timeout, the service registry (etcd) marks it inactive. The hashing ring then automatically rebalances: subsequent requests mapped to that segment are rerouted to neighboring workers. These requests result in cold reads from the underlying object store, but clients experience no visible errors or retries.

8. Summary

Alluxio has evolved from a “Big Data Acceleration Layer” into an AI-Native Data Access Platform. Through its decentralized DORA architecture, page-level caching, and cloud-native orchestration, Alluxio delivers a unique combination of performance, scalability, and cost-efficiency.

With Alluxio:
  • Data is always close to compute.
  • GPUs never wait.
  • AI workloads run anywhere, seamlessly.

Share this post
No items found.

White Paper

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer