Alluxio provides a unified namespace where you can mount multiple different storage systems and access them through the same API. To serve the file system requests to operate on all the files and directories in this namespace, Alluxio masters must handle the file system metadata at a scale of all mounted systems combined. We are writing several engineering blogs describing the design and implementation of Alluxio master to address this scalability challenge. This is the first article focusing on metadata storage and service, particularly how to use RocksDB as an embedded persistent key-value store to encode and store the file system inode tree with high performance.
Alluxio serves its metadata from a single active master as the primary and potentially multiple standby master for high availability. The master handles all metadata requests and uses a write-ahead log to journal all changes so that we can recover from crashes. The log is typically written to shared storage like HDFS for persistence and availability. Standby masters read the write-ahead log to keep their own state up-to-date. If the primary master dies, one of the standbys can quickly take over for it.
In Alluxio 1.x, all the file system metadata is stored on-heap at Alluxio master node and organized in a data structure similar to Java HashMap. While the implementation is relatively simple and straightforward, this also limits the maximum number of files to around 200 million in practice. Storing more files on-heap causes garbage collection to get out of control. For Alluxio 2, we aimed to scale up to 1 billion files.
There are a few options here. One approach is to partition the metadata across multiple masters and have the active master go look for relevant metadata. Another way is to have multiple active masters serving different portions of the namespace. In Alluxio 2.0, we started with a third option: keeping all metadata on the same node as the master, but storing a portion of the namespace on disk. This keeps the architecture simple and minimizes the number of network hops required to serve requests.
Disk Representation of inodes
Alluxio supports both read-intensive and write-intensive workloads, so we needed an on-disk storage system that does well for both. We chose the RocksDB key-value store because it combines efficient disk-friendly writes with efficient point lookups and range scans. RocksDB is embeddable and provides a key-value API where all keys and values are byte arrays.
We represent the filesystem tree as a graph of inodes and edges. Inodes contain metadata about a file or directory and are keyed by a 64-bit inode id. Edges define the parent-child relationships in the filesystem tree. The key is the 64-bit inode id for the parent concatenated with the name of the child, and the value is the inode id of the child. This representation allows for efficient creates, deletes, renames, and directory listings.
To find the metadata for the inode at path “/dir/file”, we do the following:
- Look up “/” in the inode table and find id 0
- Look up “0, dir” in the edge table and find id 1
- Look up “1, file” in the edge table and find id 2
- Look up 2 in the inode table and find inode 2.
To list the children of “/”, we do a range scan for all edges starting with 0.
Useful RocksDB features
When defining the edge table, we tell RocksDB that we will often perform range scans on the 8-bytes prefix (the inode id) in each key. RocksDB will then maintain bloom filters for the prefixes in each of its data files, so that range scans can avoid searching in files which don’t contain the key.
RocksDB lets you partition a database into multiple column families. This is convenient for storing the inode and edge tables in the same database instance.
In-memory Inode Cache
When we read inodes out of RocksDB, we deserialize them into Java Inode objects. When storing inodes back into RocksDB, we need to serialize the java objects. This serialization and deserialization are expensive. In metadata benchmarks using RocksDB as the backend, we observed that this serialization and deserialization was using around 50% of the CPU, significantly reducing metadata performance compared to the original heap-only storage.
To mitigate the overhead, we introduced an on-heap cache to store recently-accessed Java Inode objects. The cache provides fast access to commonly used inodes. It also speeds up writes by buffering changes in-memory, then asynchronously flushing them to RocksDB. When the working set fits in the cache, metadata performance is comparable to heap-only storage.Another way we can mitigate the serialization overhead is to use flat buffers. Flat buffers would allow us to access inode data without deserializing. However, to get the benefit of flat buffers, we would need to re-write all of our master code to use flat buffers instead of java Inodes.
Previously, every inode in the inode tree had an embedded read-write lock to support fine-grained concurrency control. With inodes stored in rocksdb, this approach no longer works. Instead of embedding our locks within inodes, we created a lock manager which stores a mapping from inode id to lock for recently-used inodes. Locks are created and cached as needed, with LRU-style eviction.
Our original implementation used Guava’s CacheBuilder, but we found that it consumed too much CPU when dealing with tens of thousands of operations per second. We fixed the bottleneck by implementing our own basic cache built on top of the standard library ConcurrentHashMap.
When the Alluxio master starts, it recovers its previous state by reading the latest metadata checkpoint, plus all edit logs that were written after that checkpoint. Previously, the checkpoint file and edit logs used the same format: list of protocol buffer edit entries. These entries would be read one at a time and applied to the master state. The master can process around 100k of these entries per second. This takes minutes at the scale of tens of millions of files, and hours at the scale of 1 billion files.
To improve startup time, we took advantage of the RocksDB backup engine. To create a checkpoint, we use the backup engine to create a backup, then we create a compressed tar archive of the backup. To restore, we inflate the archive, then use the RocksDB backup engine to restore from the backup. Using RocksDB backups, we sped up checkpoint restoration by 3x, bringing the time to restore 1 billion files to just under an hour.
Scaling up the storage capacity of metadata service is critical for Alluxio. Using RocksDB and leveraging disk as the storage resource combined with a set of performance optimizations, we are able to store 1 billion files and beyond with comparable performance. To learn how to use this off-heap storage feature to store a massive amount of files in Alluxio file system, please read our previous blog “Storing 1 Billion Files in Alluxio 2.0”