Today, many companies are migrating their data analytics pipelines from on-premise data warehouse to public clouds like AWS due to its scalability, elasticity, and cost-efficiency. However, this is far from simply moving data from HDFS to S3 and then run applications on some stack like EMR instead of Apache Hadoop. There are many implications and challenges for data engineers to run the original data-intensive workloads on the cloud stack efficiently.
This article describes my lessons from a previous project which moved a data pipeline originally running on a Hadoop cluster managed by my team, to AWS using EMR and S3. The goal was to leverage the elasticity of EMR to offload the operational work, as well as make S3 a data lake where different teams can easily share data across projects.
Because our data pipeline needed to process a few hundred TBs on a daily basis, we identified a few challenges related in I/O during the process of moving to the new stack of EMR on S3:
Potential I/O bottleneck: We encountered a performance bottleneck when reading from or writing to S3. Different from the original pipeline built on Hadoop cluster where compute nodes and storage nodes are deeply coupled with high data locality, this new stack uses S3 as the data lake and remotely reading from and writing to our S3 data lake can be very expensive. Our workloads also requires to perform the I/O multiple times at each pipeline job which makes it worse.
Spiky Load Handling: We constantly observed spiky I/O traffic especially when analytics jobs are writing output results in the final stage. S3 will throttle requests when the request rate is dramatically increased. However, today Hadoop-based analytics applications are typically not designed to cope with I/O throttling. Applications typically do not have mechanisms to tune the output rate either. As a result, in order to address such behavior, pipelines must retry at a very expensive computation cost.
Expensive S3 rename operations: Rename operating is very expensive on S3 as it deletes the old object after copying a new one. However, Spark or Hive typically writes into a temporary staging directory and moves the result to the final destination when the job finishes. This creates unnecessary stress on S3. A common technique for a lot of users is to write to EMRFS first which provides more efficient rename implementation and then copy the data again to S3, which requires extra step with longer completion time.
Architecture: We introduced open-source data orchestration system Alluxio into our stack. With memory as the main storage, Alluxio provides much faster data access compared to S3. In our test, we are able to write ~100GB data to Alluxio in 1 minute and persist to S3 from Alluxio in 7 minutes in the background versus writing directly to S3 often gets throttled or takes much longer.
On the read path, we use Alluxio to serve as a shared reading buffer and allows us to reduce the number of reads needed at each pipeline job. On the write path, we leverage Alluxio in two ways. For the intermediate results that can be deleted after use or transient result that is cheap to recompute if lost, Alluxio is a good fit to boost performance to write these outputs compared to S3. For the final results that need to be persistent to S3 and consumed for future use or expensive to recompute, Alluxio helps accelerate and perhaps more importantly simplify writing to a persistent store at the application layer.
Benefits: The entire architecture can be largely simplified by using Alluxio serving data as a buffer to smooth out the IO stream and avoid throttling or slowdowns. Pipeline applications can simply write to Alluxio without explicitly handling spiky loads in the application logic.
Specifically, Alluxio offers users a mechanism to control output pace to S3 and avoid throttling. With Alluxio 1.8, we used Hadoop distcp command to copy data to S3 and tuned the pace to copy to S3 with it. With Alluxio 2.0, persisting data to underlying file systems like S3 can be asynchronous so from the application’s view, the files are available for use once it’s written to Alluxio. Alluxio is memory-based and avoids blocking replication which provides much better performance than EMRFS.
Introducing Alluxio into the pipeline also gives significant performance advantages and helps address multiple challenges from underlying file system behavior. In short, Alluxio provides a great performance advantage as a memory-based shared cache. It also provides good abstraction so applications don’t need to handle the underlying storage systems when working with them.