Accelerating Slow Write-intensive Spark Data Workloads on AWS S3

Introduction

When running write-intensive ETL jobs in the cloud using Spark, MapReduce or Hive, directly writing the output to AWS S3 often yields slow, unstable performance or complicated error handling due to throughput throttling, object storage semantics among many other reasons. Some users choose to first write the output to an HDFS cluster and then upload the staging data to AWS S3 using tools like s3-dist-cp. Though optimizing the performance or eliminating corner cases for object storage, the second approach adds both cost and complexity in maintaining a staging HDFS, as well as increases the delay for the data to be available for downstream applications.

Alluxio is an open-source data orchestration system widely used to speed up data-intensive workloads in the cloud. Alluxio v2.0 introduced Replicated Async Write to allow users to complete writes to Alluxio file system and return quickly with high application performance, while still providing users with peace of mind that data will be persisted to the chosen under storage like S3 in the background.

What is Replicated Async Write

Assume that a Spark job is writing a large data set to AWS S3. To ensure that the output files are quickly written and keep highly available before persisted to S3, pass the write type ASYNC_THROUGH and a target replication level to Spark (see description in docs),

alluxio.user.file.writetype.default=ASYNC_THROUGH
alluxio.user.file.replication.durable=<N>

and set the output to the Alluxio file system client (with URI scheme: “alluxio://”)

scala> myData.saveAsTextFile("alluxio://<MasterHost>:19998/Output")

This Spark job will first synchronously write a new file to Alluxio file system with N copies before returning. In the background, the Alluxio file system will persist a copy of the new data to the Alluxio under storage like S3.

By default, the property alluxio.user.file.replication.durable is set to 1 and writing a file to Alluxio using ASYNC_THROUGH completes at memory speed if there is a colocated Alluxio worker. But if any worker crashes or restarts before the data is persisted, the data can be lost. Increasing alluxio.user.file.replication.durable to some value greater than 1 ensures the data is replicated at N different workers in Alluxio to survive N-1 worker failures before being persisted, at the cost of temporarily storing N times more data in Alluxio.

Note that, before the file is eventually persisted, in case any of these N workers fail, Alluxio will take care of the failure by re-replicating the under-replicated blocks and ensuring sufficient copies of the file; also free command will not work before the file is persisted. Once the file is persisted to the under storage, the in-Alluxio copies of this file can become evictable if another configuration allows this.

Enable Replicated Async Write in Alluxio

Configure the Alluxio property alluxio.user.file.replication.durable to a reasonable number to provide fault tolerance as well as set alluxio.user.file.writetype.default to ASYNC_THROUGH on all client applications. Note that these properties are client-side only, meaning they don’t require restarts of any Alluxio processes to use them!

For example, put the following lines into conf/alluxio-site.properties to write with three initial copies before data is persited:

alluxio.user.file.writetype.default=ASYNC_THROUGH
alluxio.user.file.replication.durable=3

Alluxio command-line is also supported to use replicated async write. For example, the copyFromLocal command can copy data from the local file system to Alluxio using replicated async write:

$ ./bin/alluxio fs -Dalluxio.user.file.writetype.default=ASYNC_THROUGH \
  -Dalluxio.user.file.replication.durable=3 copyFromLocal <localPath> <alluxioPath>

Comparison to Other Alluxio Write Types

Replicated async write is one write type that aims to balance the requirements in write speed and data persistence, but also at the transient cost of more storage capacity. Alluxio also provides a couple of additional write types realizing different trade-offs in the design space.

For an application which writes its data through Alluxio, unless the user is writing with THROUGH or CACHE_THROUGH, the data will not exist in the under storage when writing. The problem is that those types of writes are slower than their MUST_CACHE and ASYNC_THROUGH counterparts. When a file is written using MUST_CACHE or ASYNC_THROUGH, if the blocks of the file only exist on a single worker and that worker fails or crashes, then the written data will be unrecoverable. When Alluxio provides synchronous replicas of data at the time it is written, then it combines the performance of using MUST_CACHE while still providing fault-tolerance.