Getting Started with the Alluxio-Presto Sandbox

The Alluxio-Presto sandbox is a docker application featuring installations of MySQL, Hadoop, Hive, Presto, and Alluxio. The sandbox lets you easily dive into an interactive environment where you can explore Alluxio, run queries with Presto, and see the performance benefits of using Alluxio in a big data software stack.

Accelerating analytics on AWS EMR & AWS S3 with Alluxio in a disaggregated data stack

The AWS EMR service has made it easy for enterprises to bring up a full-featured analytical stack in the cloud that elastically scales based on demand. 

The EMR service along with S3 provides a robust yet flexible platform in the cloud with the click of a few buttons, compared to the highly complex and rigid deployment approach required for on-premise Hadoop Data platforms. However, because data on AWS is typically stored in S3, an object store, you lose some of the key benefits of compute frameworks like Apache Spark and Presto that were designed for distributed file systems like HDFS.

In this white paper, we’ll share some of the challenges that arise because of the impedance mismatch between HDFS and S3, the expectations of analytics workloads of the object store, and how Alluxio with EMR addresses them.

Tags: , , ,

Building Fast SQL Analytics with Presto, Alluxio, and S3

Alluxio Community Office Hour *

Learn how to set up Presto with Alluxio such that Presto jobs can seamlessly read from and write to S3.
Compare the performance between Presto on S3 with Presto and Alluxio on S3.

Alluxio on EMR: Fast Storage Access and Sharing for Spark Jobs

Traditionally, if you want to run a single Spark job on EMR, you might follow the steps: launching a cluster, running the job which reads data from storage layer like S3, performing transformations within RDD/Dataframe/Dataset, finally, sending the result back to S3. You end up having something like this.
If we add more Spark jobs across multiple clusters, you could have something like this.

How do you modify location metadata in Hive?

Problem If you have hundreds of external tables defined in Hive, what is the easist way to change those references to point to new locations? That is a fairly normal challenge for those that want to integrate Alluxio into their stack. A typical setup that we will see is that users will have Spark-SQL or … Continued