Data Consistency Model in Alluxio

October 30, 2020

Baolong Mao

Jasmine Wang

Bin Fan

Unlike HDFS which provides one-copy update semantics or AWS S3 which provides eventual consistency, data consistency in Alluxio is a bit more complicated and depends on the configuration. In short, when clients are only reading and writing through Alluxio, the Alluxio file system provides strong consistency. However, when clients are writing data across both Alluxio and under storage, the consistency may depend on the write type and under storage type.

First of all, files in the Alluxio file system are write-once-read-many and identified by unique file IDs. So given this file ID all file readers access the file without worrying about data being stale.

As a result, Alluxio provides strong consistency for Alluxio applications (like MapReduce, Spark jobs) only accessing data through the Alluxio file system. Alluxio reader clients are always able to see the new files and their content written by Alluxio writer clients, as soon as the writes finish. This is because all the file system modification operations will go through Alluxio master service first and modify Alluxio file system state before returning to the client/applications successfully. Given this process, different Alluxio clients will always get the latest update as long as their corresponding write operations complete successfully.

Writing to Alluxio and Reading from Under Storage

When taking the state of under storage into account, it is possible that data written in Alluxio is inconsistent with under storage, depending on write types:

MUST_CACHE

MUST_CACHE does not write to under storage, so Alluxio space is inconsistent with under storage.

CACHE_THROUGH

CACHE_THROUGH writes data synchronously to Alluxio and under storage before returning success to applications:

In the case with HDFS: writing to under storage is also strongly consistent, and Alluxio space will be always consistent with under storage when there are no other out-of-band updates in under storage;

In the case with S3: writing to under storage is eventually consistent, it is possible that the file is written successfully to Alluxio but does not immediately show up in under storage. Therefore creates a short window of inconsistency before the data is finally propagated to the under storage. Even though in this case, Alluxio clients will still see a consistent file system as they will always consult Alluxio master that is strongly consistent.

ASYNC_THROUGH

ASYNC_THROUGH writes data to Alluxio and returns to applications, leaving Alluxio to propagate the data to UFS asynchronously. From the user’s perspective, the file can be written successfully to Alluxio, but gets persisted to the UFS after a lag.

THROUGH

THROUGH writes data to UFS directly without caching the data in Alluxio, however, Alluxio knows the files and its status. Thus the metadata is still consistent.

Writing to Under Storage and Reading from Alluxio

Meanwhile, if data in under storage is out of sync with Alluxio, e.g., due to out of band modification in under storage without going through Alluxio, the metadata sync feature can repair the inconsistency. Users can choose to automate the synchronization process periodically or by active sync. Please refer to documentation.

Conclusion

Due to its unique position in the ecosystem, Alluxio provides flexible data consistency models to balance performance, cost, and usability. Different metadata synchronization techniques are available to reconcile Alluxio and under storages.

Share this post

Blog

Alluxio and Oracle Cloud Infrastructure: Delivering Sub-Millisecond Latency for AI Workloads

Oracle Cloud Infrastructure has published a technical solution blog demonstrating how Alluxio on Oracle Cloud Infrastructure (OCI) delivers exceptional performance for AI and machine learning workloads, achieving sub-millisecond average latency, near-linear scalability, and over 90% GPU utilization across 350 accelerators.

Make Multi-GPU Cloud AI a Reality

If you’re building large-scale AI, you’re already multi-cloud by choice (to avoid lock-in) or by necessity (to access scarce GPU capacity). Teams frequently chase capacity bursts, “we need 1,000 GPUs for eight weeks,” across whichever regions or providers can deliver. What slows you down isn’t GPUs, it’s data. Simply accessing the data needed to train, deploy, and serve AI models at the speed and scale required – wherever AI workloads and GPUs are deployed – is in fact not simple at all. In this article, learn how Alluxio brings Simplicity, Speed, and Scale to Multi-GPU Cloud deployments.

Accelerate your Cloud Object Storage for AI Workloads

Turn your existing S3 storage into an AI-ready storage layer with sub-ms latency and terabytes per second throughout per Alluxio cluster with linear scalability — no data migration required.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo