On Demand Video

Bursting Apache Spark Workloads to the Cloud on Remote Data

ALLUXIO COMMUNITY OFFICE HOUR

Accessing data to run analytic workloads in Spark across data centers and/or clouds can be challenging. Additionally, network I/O can bottleneck Spark jobs that need to read a large amount of data. A common solution is to deploy an HDFS cluster closer to Spark as a caching layer and manually copy the input data to HDFS first, purging it afterward. But this ETL process can be both time-consuming and also error-prone.

A more efficient and simpler solution is to run Spark on Alluxio as a distributed cache on top of the remote data source. While caching data transparently based on access patterns and storing the working set closer, Alluxio provides Spark jobs much higher I/O throughput with enhanced data locality. In addition, Alluxio also provides data accessibility and abstraction for deployments in hybrid and multi-cloud environments.

In this Office Hour, we will go over how to:

  • Burst on-prem Spark workloads to the cloud with Alluxio so Spark can seamlessly read from and write to remote data storage
  • Use Alluxio as the input/output for Spark applications 
  • Save and load Spark RDDs and Dataframes with Alluxio

Speaker:

Bin Fan is the founding engineer and VP of Open Source at Alluxio, Inc. Prior to Alluxio, he worked for Google to build the next-generation storage infrastructure. Bin received his Ph.D. in Computer Science from Carnegie Mellon University on the design and implementation of distributed systems.

Questions? Slack with the speakers, users, and many other community members!
Welcome to join Alluxio Global Online Meetup Group to attend online meetups like this!

Video:

Slides: