aws Archives | Page 3 of 3

“Zero-Copy” Hybrid Bursting with no App Changes

June 28, 2019

This whitepaper details how to leverage any public cloud (AWS, Google Cloud Platform, or Microsoft Azure) to scale analytics workloads directly on on-prem data without copying and synchronizing the data into the cloud. We will show an example of what it might look like to run on-demand Starburst Presto, Spark, and Hive with Alluxio in the public cloud using on-prem HDFS.

The paper also includes a real world case study on a leading hedge fund based in New York City, who deployed large clusters of Google Compute Engine VMs with Spark and Alluxio using on-prem HDFS as the underlying storage tier.

Tags: apache hive, apache spark, aws, case study, hybrid cloud, presto

Accelerating analytics on AWS EMR & AWS S3 with Alluxio in a disaggregated data stack

June 28, 2019

The AWS EMR service has made it easy for enterprises to bring up a full-featured analytical stack in the cloud that elastically scales based on demand.

The EMR service along with S3 provides a robust yet flexible platform in the cloud with the click of a few buttons, compared to the highly complex and rigid deployment approach required for on-premise Hadoop Data platforms. However, because data on AWS is typically stored in S3, an object store, you lose some of the key benefits of compute frameworks like Apache Spark and Presto that were designed for distributed file systems like HDFS.

In this white paper, we’ll share some of the challenges that arise because of the impedance mismatch between HDFS and S3, the expectations of analytics workloads of the object store, and how Alluxio with EMR addresses them.

Tags: aws, aws s3, compute storage separation, emr

Tech Talk: Accelerate Spark Workloads on S3

June 28, 2019

While running analytics workloads using EMR Spark on S3 is a common deployment today, many organizations face issues in performance and consistency. EMR can be bottlenecked when reading large amounts of data from S3, and sharing data across multiple stages of a pipeline can be difficult as S3 is eventually consistent for read-your-own-write scenarios.

A simple solution is to run Spark on Alluxio as a distributed cache for S3. Alluxio stores data in memory close to Spark, providing high performance, in addition to providing data accessibility and abstraction for deployments in both public and hybrid clouds.

Tags: aws, cloud, compute storage separation, data, data orchestration, emr, hybrid cloud, on-prem object storage, spark, tech talk

Best Practices for Using Alluxio with Apache Spark

June 6, 2017

Spark Summit SF 2017 – We briefly introduce Alluxio and present different ways Alluxio can help Spark jobs, along with best practices. We also discuss how Alluxio can be deployed and used with a Spark data processing pipeline in the cloud.

Tags: alluxio engineering, apache spark, aws, aws s3, cloud, cloud storage, conference, machine learning, spark

Alluxio Bay Area Joint Meetup at Intel 2016

June 15, 2016 by Ziya Ma [Intel], Calvin Jia, Jiri Simsa, Haoyuan Li, & Gene Pang

As the first meetup after the rebranding from Tachyon to Alluxio, we will first present exciting updates and new developments of the community. Followed by many new features and improvements in Alluxio 1.0 and 1.1 releases.

Tags: alluxio engineering, architecture, aws, aws s3, big data, cloud, cloud storage, compute, compute storage separation, data, meetup, performance, spark

Tag: aws