Products
Data Consistency Model in Alluxio
October 30, 2020
Unlike HDFS which provides one-copy update semantics or AWS S3 which provides eventual consistency, data consistency in Alluxio is a bit more complicated and depends on the configuration. In short, when clients are only reading and writing through Alluxio, the Alluxio file system provides strong consistency. However, when clients are writing data across both Alluxio and under storage, the consistency may depend on the write type and under storage type.
First of all, files in the Alluxio file system are write-once-read-many and identified by unique file IDs. So given this file ID all file readers access the file without worrying about data being stale.
As a result, Alluxio provides strong consistency for Alluxio applications (like MapReduce, Spark jobs) only accessing data through the Alluxio file system. Alluxio reader clients are always able to see the new files and their content written by Alluxio writer clients, as soon as the writes finish. This is because all the file system modification operations will go through Alluxio master service first and modify Alluxio file system state before returning to the client/applications successfully. Given this process, different Alluxio clients will always get the latest update as long as their corresponding write operations complete successfully.
Writing to Alluxio and Reading from Under Storage
When taking the state of under storage into account, it is possible that data written in Alluxio is inconsistent with under storage, depending on write types:
MUST_CACHE
MUST_CACHE does not write to under storage, so Alluxio space is inconsistent with under storage.
CACHE_THROUGH
CACHE_THROUGH writes data synchronously to Alluxio and under storage before returning success to applications:
- In the case with HDFS: writing to under storage is also strongly consistent, and Alluxio space will be always consistent with under storage when there are no other out-of-band updates in under storage;
- In the case with S3: writing to under storage is eventually consistent, it is possible that the file is written successfully to Alluxio but does not immediately show up in under storage. Therefore creates a short window of inconsistency before the data is finally propagated to the under storage. Even though in this case, Alluxio clients will still see a consistent file system as they will always consult Alluxio master that is strongly consistent.
ASYNC_THROUGH
ASYNC_THROUGH writes data to Alluxio and returns to applications, leaving Alluxio to propagate the data to UFS asynchronously. From the user’s perspective, the file can be written successfully to Alluxio, but gets persisted to the  UFS after a lag.
THROUGH
THROUGH writes data to UFS directly without caching the data in Alluxio, however, Alluxio knows the files and its status. Thus the metadata is still consistent.
Writing to Under Storage and Reading from Alluxio
Meanwhile, if data in under storage is out of sync with Alluxio, e.g., due to out of band modification in under storage without going through Alluxio, the metadata sync feature can repair the inconsistency. Users can choose to automate the synchronization process periodically or by active sync. Please refer to documentation.
Conclusion
Due to its unique position in the ecosystem, Alluxio provides flexible data consistency models to balance performance, cost, and usability. Different metadata synchronization techniques are available to reconcile Alluxio and under storages.
.png)
Blog

Make Multi-GPU Cloud AI a Reality
If you’re building large-scale AI, you’re already multi-cloud by choice (to avoid lock-in) or by necessity (to access scarce GPU capacity). Teams frequently chase capacity bursts, “we need 1,000 GPUs for eight weeks,” across whichever regions or providers can deliver. What slows you down isn’t GPUs, it’s data. Simply accessing the data needed to train, deploy, and serve AI models at the speed and scale required – wherever AI workloads and GPUs are deployed – is in fact not simple at all. In this article, learn how Alluxio brings Simplicity, Speed, and Scale to Multi-GPU Cloud deployments.

Alluxio's Strong Q2: Sub-Millisecond AI Latency, 50%+ Customer Growth, and Industry-Leading MLPerf Results
Alluxio's strong Q2 featured Enterprise AI 3.7 launch with sub-millisecond latency (45× faster than S3 Standard), 50%+ customer growth including Salesforce and Geely, and MLPerf Storage v2.0 results showing 99%+ GPU utilization, positioning the company as a leader in maximizing AI infrastructure ROI.
