Using Consistent Hashing in Presto to Improve Caching Data Locality in Dynamic Clusters

February 2, 2022

Running Presto with Alluxio is gaining popularity in the community. It avoids long latency reading data from remote storage by utilizing SSD or memory to cache hot dataset close to Presto workers. Presto supports hash-based soft affinity scheduling to enforce that only one or two copies of the same data are cached in the entire cluster, which improves cache efficiency by allowing more hot data cached locally. The current hashing algorithm used, however, does not work well when cluster size changes. This article introduces a new hashing algorithm for soft affinity scheduling, consistent hashing, to address this problem.

Soft Affinity Scheduling

Presto uses a scheduling strategy called soft affinity scheduling to schedule a split (smallest unit of data processing) to the same Presto worker (preferred node). The mapping from a split and a Presto worker is computed by a hashing function on the split, making sure the same split will always be hashed to the same worker. The first time a split is processed, data will be cached on the preferred worker node. When subsequent queries process the same split, these requests will be scheduled to the same worker node again. Since data is already cached locally, no remote read will be necessary.

To improve load balancing and handle flaky workers, two preferred nodes are chosen. If the first choice is busy or unresponsive, the second one is used. Data might be cached at 2 worker nodes physically.

For more details on soft affinity scheduling, please read "Improving Presto Latencies with Alluxio Data Caching".

Hashing Algorithm

Soft affinity scheduling relies on hashing algorithms to compute the mapping between split and worker nodes. Previously, modular function is used:

WorkerID1 =Hash(splitID) % workerCount
WorkerID2 =Hash(splitID) % workerCount + 1

This hashing strategy is simple and works well when the cluster is stable and there’s no change on worker nodes. However, if a worker is temporarily unavailable or down, worker count could change, and the split to worker mapping would be completely reshuffled, causing cache hit rate to drop significantly. If the problematic worker comes back online later, this reshuffle would happen again.

To mitigate this issue, Presto uses all worker count instead of active worker count when using modular to compute worker assignment. However, this can only mitigate rehashing caused by temporary worker nodes offline. There are situations where it makes sense to add / remove workers due to workload fluctuations. In these scenarios, is it possible to still keep a reasonable cache hit rate and not introduce massive rehashing? The solution is consistent hashing.

Consistent Hashing

The concept of Consistent hashing was introduced in 1997 by David Karger as a way of distributing requests among a changing population of web servers. The technique is widely used in load balancing, distributed hash tables, etc.

How does consistent hashing work?

Imagine that the hash output range [0, MAX_VALUE] is mapped onto a circle (connects the MAX_VALUE to 0). To demonstrate how consistent hashing works, assume a Presto cluster of 3 Presto worker nodes and there are 10 splits that are queried repeatedly.

First, the worker nodes are hashed on to the hashing ring. For each split, it will be assigned to the worker that’s next to its hash value on the hashing ring.

In the scenario above, the splits are assigned as following:

Worker1Split1, Split4, Split6, Split8Worker2Split0, Split5, Split7Worker3Split2, Split3, Split9

Removing a worker

Now if worker2 becomes offline for some reason, according to the algorithm, split 0, 5 and 7 will be scheduled to the worker with the next hash value, which is worker2:

Worker1Split0, Split1, Split4, Split5, Split6, Split7, Split8Worker3Split2, Split3, Split9

Only the splits that were hashed to the offline worker (worker3 in our example) need to be rehashed. Other data were not affected. If worker3 comes online later, Split 2, 3 and 9 will again be hashed to worker3, not affecting hit rate on other workers.

Adding a worker

Now if the workload increases and another worker node, worker4, needs to be added to the cluster. Worker4’s hash value is on the hashing ring as following:

In this case split8 will fall into the range of worker4, all other splits’ assignments are not affected, thus cache hit rate on those splits will not be affected. The new assignment is:

Worker1Split1, Split4, Split6Worker2Split0, Split5, Split7Worker3Split2, Split3, Split9Worker4Split8

Virtual nodes

As you can see from above, consistent hashing can guarantee that in the situation of node changes, on average only N_splits / N_nodes of splits need to be rehashed. However, due to lack of randomness in worker distribution, the splits might not be uniformly distributed among all worker nodes. The concept of “virtual nodes” is introduced to mitigate this issue. Virtual nodes can also help redistribute a node’s load to multiple nodes when they are disconnected, which reduces load fluctuation due to cluster instability.

Each physical worker node has multiple virtual nodes mapped to it. Virtual nodes are put on the hashing ring. A split will be assigned to the next virtual node on the hashing ring, thus route to the physical node mapped to the virtual node. The following examples shows a possible scenario where each physical worker node has 3 virtual nodes:

Worker1Worker1_v1Split6Worker1_v2Split0Worker1_v3Split1Worker2Worker2_v1Split5, Split7Worker2_v2Split4Worker2_v3Split9Worker3Worker3_v1Split2, Split3Worker3_v2Woker3_v3Split8

As the number of nodes on the hashing ring increases, the hash space is more likely to be evenly partitioned.

In the scenario of a physical node down, all virtual nodes corresponding to that physical node will be rehashed. But now instead of all splits being rehashed to the same node, they will be distributed across multiple virtual nodes, thus mapping to multiple physical nodes, providing better load balancing.

Following shows when worker3 is removed, Split2 and 3 are rehashed to worker2, while Split8 is rehashed to worker1.

Worker1Worker1_v1Split6Worker1_v2Split4, Split8Worker1_v3Split1Worker2Worker2_v1Split5, Split7Worker2_v2Split0, Split2, Split3Worker2_v3Split9

How to use consistent hashing in Presto?

This is currently an experimental feature that we recently contributed to Presto. Feel free to contact us if you are interested in testing or collaboration.

To use this feature, first enable caching with Presto by following this instruction or this tutorial.

Make sure you choose SOFT_AFFINITY as scheduling policy. In /catalog/hive.properties, add hive.node-selection-strategy=SOFT_AFFINITY.

Enable consistent hashing. In config.properties, add node-scheduler.node-selection-hash-strategy=CONSISTENT_HASHING.

Conclusion

As illustrated above, consistent hashing can minimize the impact of workload assignments when nodes are introduced or removed. Scheduling workload based on consistent hashing can minimize the impact on cache hit rate on existing nodes when a cluster’s worker nodes change. This makes consistent caching a better strategy to use in situations where Presto’s cluster size would be scaled up and down according to workload needs or in situations where deployment does not have total control of the hardware, and workers can be potentially relocated every now and then.

At Alluxio community, we have constantly been improving the integration between Alluxio and data applications (e.g., Presto in this article) both in functionality and usability. With the introduction of consistent hashing in Presto scheduling, Alluxio can better leverage the potential of soft affinity in Presto with higher data locality and cache efficiency, which can be translated to better performance and cost efficiency. We will continue bringing further improvements and optimizations to the data ecosystem.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo