How do you run Spark on On-Premise S3?

Introducing S3 and Spark

S3 has become the de-facto standard API for digital business applications to store unstructured data chunks. To this end, several vendors have S3-API compatible offerings that allow app developers to standardize on the S3 API’s on-premise, and port these apps to run on other platforms when ready.

So, what is S3 and Spark for newbies?

S3 is an object storage service that was created originally by Amazon, with first class support for scalability, data availability, performance and security. It offers a rich set of API’s that abstract the underlying data store, allowing access from virtually anywhere over the network. In S3, the basic storage units are called objects, which are organized into buckets. Each object is identified via its key and has supporting metadata associated with it. 

S3 by itself is not very interesting unless there is a some workload running on it. That’s where Spark comes in. Spark is a fast, general purpose cluster computing framework for large scale data processing. You can think of Spark as a high-performance batch and stream processing engine.  It has powerful libraries that makes it easy to use the SQL language with the help of SPARK SQL. It enables query in different programming languages and returns data in the form of datasets and dataframes. Spark is maintained by the Apache Foundation and comes with Apache 2.0 license.

Problems

Object stores with an S3-compatible API like Cloudian, Minio, and SwiftStack might appear to be like filesystems. In reality, they are not classic POSIX filesystems, and the differences are quite significant. Changes made in a typical filesystem are generally immediately visible, but in the case of object stores, changes are eventually consistent. In order to store petabytes of data, object stores replace the classic filesystem directory tree structure with a simpler key-value model. Overlaying a directory like structure on an object store will slow it down.  Additionally, file operations such as rename are also quite expensive since the rename would require multiple slow HTTP REST calls (copy to destination, and delete source) to complete.

How does this affect Spark?
  1. Reading and writing data can be significantly slower than working with a normal filesystem.
  2. Some directory structures may be very inefficient to scan.
  3. The output of the Spark job may not be immediately visible.
  4. The rename-based algorithm by which Spark normally commits work when saving a dataset is potentially both slow and unreliable.
Existing Solutions

Spark doesn’t have a native S3 implementation and relies on Hadoop classes to abstract the data access. Hadoop provides 3 filesystem clients for S3 (s3n, s3a, and block s3). Getting Spark to work with S3 through these connectors requires a lot of fine tuning to get more performance predictability from the Spark jobs. For example, it is important to analyze and keep in check the time taken by Spark before starting the real work (in terms of format transformations), and after the work is completed (to write the results back).

How Alluxio Helps

Ideally, this process of reading S3 data into Spark and enabling data sharing should be automated and transparent. One can deploy a data orchestration layer like Alluxio to serve the data to Spark and improve the end-to-end model development efficiency. For example, Alluxio can be deployed colocated with the Spark cluster, exposing the data through Alluxio POSIX or HDFS compatible interfaces, backed by the mounted remote storage like S3.