Making the Impossible Possible with Alluxio: Accelerate Spark Jobs from Hours to Seconds

Barclays Data Scientist Gianmario Spacagna and Harry Powell, Head of Advanced Analytics, describe how they iteratively process raw data directly from the central data warehouse into Spark and how Alluxio is their key enabling technology.

Alluxio is the in-memory storage solution. Alluxio is the in-memory storage layer for data, so any Spark application can access the data in a straightforward way through the standard file system API as you would for HDFS. Alluxio enables us to do transformations and explorations on large datasets in memory, while enjoying the simple integration with our existing applications.

In this article, we first present how our existing infrastructure loads raw data from an RDBMS and uses Spark to transform it into a typed RDD collection. Then, we discuss the issues we face with our existing methodology. Next, we show how we deploy Alluxio and how Alluxio greatly improves the workflow by providing the desired in-memory storage and minimizing the loading time at each iteration. Finally, we discuss some future improvements to the overall architecture.