Metadata synchronization (sync) is a core feature in Alluxio that keeps files and directories consistent with their source of truth in under storage systems, thus making it simple for users to reason the data retrieved from Alluxio. Meanwhile, understanding the internal process is important in order to tune the performance. This article describes the design and the implementation in Alluxio to keep metadata synchronized.
Software Engineer, Alluxio
Data processing is increasingly making use of NVIDIA computing for massive parallelism. Advancements in accelerated compute mean that access to storage must also be quicker, whether in analytics, artificial intelligence (AI), or machine learning (ML) pipelines.
Alluxio provides a distributed data access layer for applications like Spark or Presto to access different underlying file system (or UFS) through a single API in a unified file system namespace. If users only interact with the files in the UFS through Alluxio, since Alluxio has knowledge of any changes the client makes to the UFS, it will keep Alluxio namespace in sync with the UFS namespace.
This blog describes our experience in speeding up Alluxio metadata operations using fingerprint and Alluxio under store bulk operations. These latest optimizations can be found in the 1.8.1 release.
One of the major values Alluxio provides is a simple and unified interface to manage files and directories on different underlying storage systems. Alluxio acts as an intermediate layer and exposes a file interface for applications to interact with, even though the underlying storage system might be an object store that has a different interface.