Effective caching of Spark Resilient Distributed Datasets (RDDs) with Alluxio

Organizations like Baidu and Barclays have deployed Alluxio with Spark in their architecture, and have achieved impressive benefits and gains. Recently, Qunar deployed Alluxio with Spark in production and found that Alluxio enables Spark streaming jobs to run 15x to 300x faster. In their blog post, they described how Alluxio improved their system architecture, and mentioned that some existing Spark jobs would slow down or would never finish because they would run out of memory. Aer using Alluxio, those jobs were able to finish, because the data could be stored in Alluxio, instead of within Spark. In this blog, we investigate how Alluxio can make Spark more effective, and discuss various ways to use Alluxio with Spark. Alluxio helps Spark perform faster, and enables multiple Spark jobs to share the same, memory-speed data. We conducted a few simple and controlled experiments with Spark and Alluxio. For these experiments we used Spark version 2.0.0, and Alluxio 1.2.0.

In this article, we show by saving RDDs in Alluxio, Alluxio can keep larger data sets in-memory for faster Spark applications, as well as enable sharing of RDDs across separate Spark applications.