Learn how to set up EMR Spark with Alluxio so Spark jobs can seamlessly read from and write to S3. See the performance comparison between Spark on S3 with Spark, and Alluxio on S3.
The AWS EMR service has made it easy for enterprises to bring up a full-featured analytical stack in the cloud that elastically scales based on demand.
The EMR service along with S3 provides a robust yet flexible platform in the cloud with the click of a few buttons, compared to the highly complex and rigid deployment approach required for on-premise Hadoop Data platforms. However, because data on AWS is typically stored in S3, an object store, you lose some of the key benefits of compute frameworks like Apache Spark and Presto that were designed for distributed file systems like HDFS.
In this white paper, we’ll share some of the challenges that arise because of the impedance mismatch between HDFS and S3, the expectations of analytics workloads of the object store, and how Alluxio with EMR addresses them.
This webinar highlights a simple solution is to run Spark on Alluxio as a distributed cache for S3. Alluxio stores data in memory close to Spark, providing high performance, in addition to providing data accessibility and abstraction for deployments in both public and hybrid clouds.
Traditionally, if you want to run a single Spark job on EMR, you might follow the steps: launching a cluster, running the job which reads data from storage layer like S3, performing transformations within RDD/Dataframe/Dataset, finally, sending the result back to S3. You end up having something like this.
If we add more Spark jobs across multiple clusters, you could have something like this.
Problem If you have hundreds of external tables defined in Hive, what is the easist way to change those references to point to new locations? That is a fairly normal challenge for those that want to integrate Alluxio into their stack. A typical setup that we will see is that users will have Spark-SQL or … Continued
Many organizations are leveraging EMR to run big data analytics on public cloud. However, reading and writing data to S3 directly can result in slow and inconsistent performance. Alluxio is a data orchestration layer for the cloud, and in this use case it caches data for S3, ensuring high and predictable performance as well as reduced network traffic.