Cross Cluster Synchronization in Alluxio: Part 1 - Scenarios and Background

February 9, 2023

Hope Wang

This is a blog series talking about the design and implementation of the Cross Cluster Synchronization mechanism in Alluxio. This mechanism ensures that the metadata is consistent when running multiple Alluxio clusters. Part 1 of this blog series discusses the scenario and background.

Alluxio lies in between the storage and compute layers in order to provide high performance caching and a unified namespace over different under files systems (UFSs). While performing updates through Alluxio to the UFS will result in Alluxio being consistent with the UFSs, there may be cases where this is not true, for example when running multiple Alluxio clusters that share one or many UFS namespaces. In order to ensure consistency in this case the Cross Cluster Synchronization mechanism has been implemented in Alluxio which will be described in this blog series.

Background

As the amount of data grows, so does the complexity of how that data is stored and accessed. For example it may be located in different storage systems (S3, GCP, HDFS, etc.), it may be in the cloud or on premises, it may be in different regions, and it may further be isolated due to privacy or security constraints. Furthermore, these complexities are not just constrained to data storage, but also to how computation is done on the data, for example the data may be stored on the cloud, with compute taking place on premises.

Alluxio is a data orchestration platform that reduces these complexities by exposing a unified interface over the underlying storage systems (UFSs) and improves performance through data locality and caching.

While for many organizations it may be sufficient to run a single Alluxio cluster, others may want to run multiple Alluxio clusters. For example, if compute is run in multiple regions, an organization may benefit from running an Alluxio cluster in each of these regions. For others, running separate clusters may be necessary for data privacy reasons. Or an organization may simply want to run multiple clusters for increased scalability. While certain portions of the data space may be isolated to a single cluster, other data may be shared between multiple clusters. For example one cluster may be responsible for ingesting and transforming data while several other clusters may then query this data and possibly update it.

As each Alluxio cluster may replicate (i.e. mount) certain portions of UFS storage space, Alluxio is then responsible for keeping its copies consistent with the UFS so that the users observe the most up to date copy of the files. In this article, we will examine the components that are used in Alluxio to keep data consistent with the UFS with one or more clusters.

Data Consistency in Alluxio

Keeping data consistent in distributed systems is complex and there are dozens of different consistency levels, each allowing different clients to observe and modify different states of the data at any given time. These consistency levels create a spectrum going from weak to strong, with stronger consistency being more restrictive and generally easier to build applications on top of. Alluxio is no exception to these complexities and will provide different consistency guarantees depending on the configuration and UFS used (see Data Consistency Model in Alluxio for more details).

To simplify the discussion on consistency, we will make the following assumption:

For any file the UFS is the `source of truth` of the file.

This implies that every file in Alluxio corresponds to a file on the UFS, and the UFS always has the most up to date version of the file. If Alluxio stores a copy of a file which differs from the UFS then the version of the file in Alluxio is inconsistent. (As an additional detail, we assume that the UFS itself ensures strong consistency, specifically some form of linearizability or external consistency. From a high level this allows users to access the UFS as if it was a single file system executing operations sequentially in real time, even if the system is made up of many distributed pieces.)

Before discussing consistency between Alluxio and the UFS, let us consider the basic Alluxio architecture. Alluxio is made up of master and worker nodes. The masters are responsible for tracking the metadata of the file, for example its path, its size, etc. while workers are responsible for storing the data itself. For a client to read a file, it must first read the metadata from one of the masters, which will then be used to locate a worker storing a copy of the data (the data may be loaded from the UFS if necessary). For a client to write a file it must first create the metadata for the file in the master, then write the file to the UFS through a worker, and finally mark the file as complete on the master. While the file is being written, its metadata is flagged as incomplete, preventing other clients from accessing the file.

From this basic design we can see that as long as all updates to the files go through Alluxio to the UFS, then the data in Alluxio will remain consistent with the data in the UFS and clients will always observe the most up to date state of the data.

Unfortunately, in practice things are not always this simple, for example certain clients may update the UFS without going through Alluxio, or a client may fail while only partially writing the file to the UFS and without completing it on the Alluxio master. These, among other things, may result in inconsistencies between the data in Alluxio and the UFS.

So how are these issues dealt with? Since our key assumption states that the UFS is the source of truth, fixing these inconsistencies is done simply by synchronizing Alluxio with the UFS.

Metadata Sync

Metadata sync is the primary component used to detect and fix inconsistencies between Alluxio and the UFS. It may be triggered (based on certain conditions that will be discussed later) when a client accesses a path in Alluxio. The basic procedure works as follows:

Load the metadata for the path from the UFS.
Compare the UFS metadata with the metadata in Alluxio. A fingerprint of file data (possibly including things like last modified time and a collision resilient hash) is included with the metadata to check for data inconsistencies.
If any discrepancies are found, update the metadata in Alluxio, and mark any out of date data to be evicted from the workers. The new version of the data will be loaded from the UFS to the workers as necessary.

Figure : The metadata sync process during a client read. 1. The client reads a path in the file system. 2. The Metadata Sync module on the master checks if a sync is needed based on user configuration. 3. A synchronization is performed by loading the metadata from the UFS, and a fingerprint is created to compare the metadata in Alluxio with the UFS. The metadata is updated in Alluxio if the fingerprints differ. 4. The client reads the file data from the worker using the updated metadata, loading the data from the UFS as necessary.

The only thing left is to decide when to execute this procedure, which gives us a trade off between stronger consistency and better performance.

Metadata Sync on Every Access

If metadata sync is performed every time a path is accessed by a client in Alluxio, then the client will always see the most up to date state of data on the UFS. This will give us the strongest consistency levels, up to as strong as the consistency ensured by the UFS in most cases. Unfortunately this will result in degraded performance as each access includes a synchronization with the UFS even if no data has been modified.

Timed Based Metadata Sync

Alternatively the metadata sync can be performed based on a physical time interval. In this case, the metadata on the Alluxio master includes the time at which the path was last synchronized successfully with the UFS. Now only if a user defined interval has passed then a new synchronization will be performed (see UFS metadata sync for more details).

While this may greatly improve performance, it also results in a much weaker guarantee of eventual consistency. This means that any given read may or may not be consistent with the UFS. Furthermore, updates may be observed in any order. For example, even if in the UFS some file A is updated before another file B, the Alluxio cluster may observe the update to file B happening before file A. Thus, it is important for users of the system to understand these guarantees and adjust their applications as needed.

To summarize, this blog, as the first of the blog series, has briefly described how metadata sync is done with a single Alluxio cluster. The next blog will describe how this feature is built upon to provide metadata consistency in a multi-cluster scenario.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo