What can I do to speed up analytics performance on remote data?


Today’s advanced analytics applications run on more datasets that ever before. The locations of where data “lands” is becoming more dispersed. And the separation of compute and storage in modern environments lends well to running on these distributed datasets. Data can be stored in a remote location from the compute, such as in a different cloud or data center. This approach is sometimes referenced as a hybrid or multi-cloud data analytics environment. 

Remote data challenges

With more connectors being made available in popular application frameworks, it’s increasingly possible to query/transform remote data. However, the remoteness of the data, due to network latency and bandwidth, poses challenges. If a dataset is read many times, the network latency can add up many times as well. If there’s a large amount of data, then the bandwidth may become a limiting factor. Also as cloud providers have asymmetric pricing, the egress charges can become significant. 

Two possible solutions

Copying the data into the place where the application compute resides is one way to remove the remote data challenges. The difficulty of this approach is maintaining the copies and the additional costs associated with those copies. Some organizations are also prevented from copying data to other locations, due to regulatory, compliance, or data sovereignty requirements. 

For the above reasons, at Alluxio, we have a saying that “Copying data causes more problems than it solves.” The Alluxio open source project is designed to leave the data where it is and solve for the access and locality. The Alluxio solution can be co-located with your distributed application framework. For example, if you have 100 instances of Spark or Presto workers, you’d also have 100 workers of Alluxio running colocated with those instances. The Alluxio layer provides the data orchestration to bring the active set of data to your applications. 

Example of the datasets becoming local with Alluxio

Let’s look as basic Terasort job running on the following three architectures:

The performance results show Alluxio executing the query at near local levels of performance:

Why? Because Alluxio has brought the dataset local to MapReduce workers:

Feel free to reach out if you’d like the specifics to run this yourself. These results are representative of other expected results for other compute frameworks: Spark, Hive, Presto, TensorFlow, PyTorch. All popular compute frameworks can benefit from adding Alluxio.

Click here to read more about how Alluxio handles network latency and bandwidth