emr Archives | Alluxio

Bursting on-premise analytic workloads to Amazon EMR using Alluxio

December 13, 2020

Data infrastructure on-premises is increasingly complex and cloud adoption is attractive for business agility. Operating a hybrid environment is an approach to start benefiting from cloud elasticity quickly without abandoning the infrastructure on-premises. In this session I will discuss the benefits of using Alluxio’s Data Orchestration Platform to dynamically burst Apache Spark and Presto workloads to Amazon EMR for best performance and agility.

Tags: amazon, data orchestration, data orchestration summit, emr

Bursting Spark or Presto Jobs to AWS using Alluxio

Community Online Office Hour * June 23, 2020

In this office hour, we demonstrate how a “zero-copy burst” solution helps to speed up Spark and Presto queries in the public cloud while eliminating the process of manually copying and synchronizing data from the on-premise data lake to cloud storage. This approach allows compute frameworks to decouple from on-premise data sources and scale efficiently by leveraging Alluxio and public cloud resources such as AWS.

“Zero-Copy” Hybrid Cloud for Data Analytics – Strategy, Architecture and Benchmark Report

April 6, 2020

This whitepaper details how to leverage a public cloud, such as Amazon AWS, Google GCP, or Microsoft Azure to scale analytic workloads directly on data on-premises without copying and synchronizing the data into the cloud. We will show an example of what it might look like to run on-demand Presto and Hive with Alluxio in the public cloud using on-prem HDFS. We will also show how to set up and execute performance benchmarks in two geographically dispersed Amazon EMR clusters along with a summary of our findings.

Tags: aws, azure, data analytics, emr, gcp, hdfs, hive, hybrid cloud, presto, public cloud, zero copy

Running Presto with Alluxio on Amazon EMR

February 12, 2020

Many organizations are leveraging EMR to run big data analytics on public cloud. However, reading and writing data to S3 directly can result in slow and inconsistent performance. Alluxio is a data orchestration layer for the cloud, and in this use case it caches data for S3, ensuring high and predictable performance as well as reduced network traffic.

Tags: aws s3, big data, cloud, compute storage separation, emr, office hour, presto, storage

Running Presto with Alluxio on Amazon EMR

Community Online Office Hour * February 12, 2020

Online Meetup: AWS S3 + Alluxio + Presto = ❤️ The Ryte Use Case

October 10, 2019

This online meetup shows why and how we solve some challenging technical issues, improve the speed, and reduce the costs of our AWS EMR Hadoop & Presto -Backend with Alluxio to an awesome level.

Tags: aws, aws s3, emr, hadoop, presto

Getting Started with EMR Hive on Alluxio in 10 Minutes

October 8, 2019 By Bin Fan

This tutorial describes steps to set up an EMR cluster with Alluxio as a distributed caching layer for Hive, and run sample queries to access data in S3 through Alluxio.

AWS S3 + Alluxio + Presto = ❤️ The Ryte Use Case

Alluxio Open Source Online Meetup * October 9, 2019

In this presentation, Ryte’s Chapter lead engineer, Danny Linden, shows why & how we solve some challenging technical issues, improve the speed, and reduce costs of our AWS EMR Hadoop & Presto -Backend with Alluxio to an awesome level!

Effective Analytical Pipelines on AWS Using EMR, Alluxio, and S3

September 27, 2019 By Yunling Cai

This article describes my lessons from a previous project which moved a data pipeline originally running on a Hadoop cluster managed by my team, to AWS using EMR and S3. The goal was to leverage the elasticity of EMR to offload the operational work, as well as make S3 a data lake where different teams can easily share data across projects.

Tag: emr