Designing the Presto Local Cache at Uber A collaboration between Uber and Alluxio part 2

May 31, 2022

In the previous blog, we introduced Uber’s Presto use cases and how we collaborated to implement Alluxio local cache to overcome different challenges in accelerating Presto queries. The second part discusses the improvements to the local cache metadata.

File Level Metadata for Local Cache

Motivation

First, we want to prevent stale caching. The underlying data files might be changed by the third-party frameworks. Note that this situation might be rare in Hive tables but very common in Hudi tables.

Second, the daily reads of unduplicated data from HDFS can be large, but we don't have enough cache space for caching all the data. Therefore, we can introduce scoped quota management by setting a quota for each table.

Third, metadata should be recoverable after server restart. We have stored metadata in the local cache in memory instead of disk, which makes it impossible to recover metadata when the server is down and restarted.

High-Level Approach

Therefore, we propose the File Level Metadata, which holds and keeps the last modified time and the scope of each data file we cached. The file-level metadata store should be persistent on disk so the data will not disappear after restarting.

With the introduction of file-level metadata, there will be multiple versions of the data. A new timestamp is generated when the data is updated, corresponding to a new version. A new folder storing the new page is created corresponding to this new timestamp. At the same time, we will try to remove the old timestamp.

Cache Data and Metadata Structure

As shown above, we have two folders corresponding to two timestamps: timestamp1 and timestamp2. Usually, when the system is running, there will not be two timestamps simultaneously because we will delete the old timestamp1 and keep only timestamp2. However, in the case of a busy server or high concurrency, we may not be able to remove the timestamp on time, in which case we may have two timestamps at the same time. In addition, we maintain a metadata file that holds the file information in protobuf format and the latest timestamp. This ensures that Alluxio's local cache only reads data from the latest timestamp. When the server restarts, the timestamp information is read from the metadata file so that the quota and last modified time can be managed correctly.

Metadata Awareness

Cache Context

Since Alluxio is a generic caching solution, it still needs the compute engine, like Presto, to pass the metadata to Alluxio. Therefore, on the Presto site, we use the HiveFileContext. For each data file from the Hive table or Hudi table, Presto creates a HiveFileContext. Alluxio makes use of this information when opening a Presto file.

When calling openFile, Alluxio creates a new instance of PrestoCacheContext, which holds the HiveFileContext and has the scope (four levels: database, schema, table, partition), quota, cache identifier (i.e., the md5 value of the file path), and other information. We will pass this cache context to the local file system. Alluxio can thus manage metadata and collect metrics.

Per Query Metrics Aggregation on Presto Side

In addition to passing data from Presto to Alluxio, we can also call back to Presto. When performing query operations, we will know some internal metrics, such as how many bytes of data read hit the cache and how many bytes of data were read from external HDFS storage.

As shown below, we pass the HiveFileContext containing the PrestoCacheContext to the local cache file system (LocalCacheFileSystem), after which the local cache file system calls back (IncremetCounter) to the CacheContext. Then, this callback chain will continue to the HiveFileContext, and then to RuntimeStats.

In Presto, RuntimeStats is used to collect metrics information when executing queries so that we can perform aggregation operations. After that, we can see the information about the local cache file system in Presto's UI or the JSON file. We can make Alluxio and Presto work closely together with the above process. On the Presto side, we have better statistics; on the Alluxio side, we have a clearer picture of the metadata.

Future Work

Performance Tuning

Because the callback process described above makes the CacheContext's lifecycle grow considerably, we have encountered some problems with rising GC latency, which we are working to address.

Adopt Semantic Cache (SC)

We will implement Semantic Cache (SC) based on the file-level metadata we propose. For example, we can save the data structures in Parquet or ORC files, such as footer, index, etc.

More Efficient Deserialization

To achieve more efficient deserialization, we will use flatbuf instead of the protobuf. Although protobuf is used in the ORC factory to store metadata, we found that the ORC’s metadata brings more than 20-30% of the total CPU usage in Alluxio’s collaboration with Facebook. Therefore, we are planning to replace the existing protobuf with a flatbuf to store cache and metadata, which is expected to improve the performance of deserialization significantly.

To summarize, together with the previous blog, this two-part blog series shares how we created a new caching layer of the hot data needed for our Presto fleet based on recent open-source collaboration between Presto and Alluxio communities at Uber. This architecturally simple and clean approach can significantly reduce HDFS latency with managed SSD and consistent hashing-based soft affinity scheduling. Join 9000+ members in our community slack channel to learn more.

About the Authors

Chen Liang is a senior software engineer at Uber's interactive analytics team, focusing on Presto. Before joining Uber, Chen was a staff software engineer at LinkedIn's Big Data platform. Chen is also a committer and PMC member of Apache Hadoop. Chen holds two master's degrees from Duke University and Brown University

Dr. Beinan Wang is a software engineer from Alluxio and is the committer of PrestoDB. Before Alluxio, he was the Tech Lead of the Presto team in Twitter, and he built large-scale distributed SQL systems for Twitter's data platform. He has twelve-year experience working on performance optimization, distributed caching, and volume data processing. He received his Ph.D. in computer engineering from Syracuse University on the symbolic model checking and runtime verification of distributed systems.

Share this post

Blog

Diagnose & Fix Slow Distributed Training

Got periodic drops in GPU utilization? GPU Stalls? Training capacity grinding to a halt? Learn how checkpoint writes could be the cause of your suddent, yet periodic drops in training performance.

20x Faster Training Data Reads with Alluxio and Ray on Anyscale: A Cross-Region Benchmark

Alluxio and Anyscale benchmark achieves 20x faster cross-region data reads for AI training workloads on GCS.

Alluxio AI 3.9 Brings Checkpoint Acceleration to Any AI Training Framework

Alluxio AI 3.9 introduces POSIX Write Cache, eliminating the checkpoint write bottleneck in distributed training with 7.6 GiB/s per node throughput and sub-2ms P99 latency. Get all of the details here!

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo