Bursting Spark or Presto Jobs to AWS using Alluxio

Tags: , , , , , , , , ,


The hybrid cloud model, where cloud resources run Spark or Presto jobs against data stored on-premises, is an appealing solution to reduce resource contention in on-premise environments while also saving in overall costs. One key flaw in a hybrid model is the overhead associated with transferring data between the two environments. Data and metadata locality within the compute application must be achieved in order to maintain the similar performance of analytics jobs as if the entire workload was run on-premises.

In this office hour, we demonstrate how a “zero-copy burst” solution helps to speed up Spark and Presto queries in the public cloud while eliminating the process of manually copying and synchronizing data from the on-premise data lake to cloud storage. This approach allows compute frameworks to decouple from on-premise data sources and scale efficiently by leveraging Alluxio and public cloud resources such as AWS. 

We will cover:

  • Typical challenges of moving data to the cloud and expanding compute capacity.
  • Details about “zero-copy” hybrid cloud solution for burst computing
  • A demo of running Presto analytic queries using remote on-prem HDFS data with Alluxio deployed in AWS EMR


Lu Qiu has been involved in open source software for many years and is currently a software engineer at Alluxio. Lu develops easier ways for Alluxio integration in the public cloud environment. Lu receives an M.S. degree from George Washington University in Data Science.

Bin Fan is the founding engineer and VP of Open Source at Alluxio, Inc. Prior to Alluxio, he worked for Google to build the next-generation storage infrastructure. Bin received his Ph.D. in Computer Science from Carnegie Mellon University on the design and implementation of distributed systems.

Questions? Slack with speakers, users, and many other community members!
Join the Alluxio Global Online Meetup Group to attend more online events.