Burst Presto & Spark workloads to AWS EMR with no data copies

Tags: , , , , , , ,

ALLUXIO COMMUNITY OFFICE HOUR

Today’s conventional wisdom states that network latency across the two ends of a hybrid cloud prevents you from running analytic workloads in the cloud with the data on-prem. As a result, most companies copy their data into a cloud environment and maintain that duplicate data. All of this means that it is challenging to make both on-prem HDFS data accessible with the desired application performance.

In this talk, we will show you how to leverage any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) to scale analytics workloads directly on on-prem data without copying and synchronizing the data into the cloud.

In this Office Hour, we will go over:

  • A strategy to embrace the hybrid cloud, including an architecture for running ephemeral compute clusters using on-prem HDFS.
  • An example of running on-demand Presto, Spark, and Hive with Alluxio in the public cloud.
  • An analysis of experiments with TPC-DS to demonstrate the benefits of the given architecture.

Speakers:

Adit Madan is a core maintainer and PMC member of the Alluxio Open Source project. His experience is in distributed systems, storage systems, and large scale data analytics. He has an M.S. from Carnegie Mellon University and a B.S. from IIT. 

Bin Fan is the founding engineer and VP of Open Source at Alluxio, Inc. Prior to Alluxio, he worked for Google to build the next-generation storage infrastructure. Bin received his Ph.D. in Computer Science from Carnegie Mellon University on the design and implementation of distributed systems.

Questions? Slack with speakers, users, and many other community members!
Join the Alluxio Global Online Meetup Group to attend more online events.

Video:

Slides: