Accelerating Write-intensive Data Workloads on AWS S3

Alluxio is an open-source data orchestration system widely used to speed up data-intensive workloads in the cloud. Alluxio v2.0 introduced Replicated Async Write to allow users to complete writes to Alluxio file system and return quickly with high application performance, while still providing users with peace of mind that data will be persisted to the chosen under storage like S3 in the background.

Am I bottlenecking on S3, S3 throttling my requests?

When working with big data analytics and AI, you are likely reading and writing terabytes from S3, in some cases even at very high data transfer rates. There are several reasons why you may find S3 is slowing down your data-intensive job.

Efficient Data Engineering with Apache Spark, Hive, and Alluxio on S3

Alluxio Meetup | Austin *

Welcome to the first event of the Cloud, Data, & Orchestration Austin Meetup! This meetup will feature two talks and an opportunity to engage with other data engineers, developers, and Alluxio users. Thanks to Bazaarvoice for hosting!

Getting Started with the Alluxio-Presto Sandbox

The Alluxio-Presto sandbox is a docker application featuring installations of MySQL, Hadoop, Hive, Presto, and Alluxio. The sandbox lets you easily dive into an interactive environment where you can explore Alluxio, run queries with Presto, and see the performance benefits of using Alluxio in a big data software stack.

Accelerating analytics on AWS EMR & AWS S3 with Alluxio in a disaggregated data stack

The AWS EMR service has made it easy for enterprises to bring up a full-featured analytical stack in the cloud that elastically scales based on demand. 

The EMR service along with S3 provides a robust yet flexible platform in the cloud with the click of a few buttons, compared to the highly complex and rigid deployment approach required for on-premise Hadoop Data platforms. However, because data on AWS is typically stored in S3, an object store, you lose some of the key benefits of compute frameworks like Apache Spark and Presto that were designed for distributed file systems like HDFS.

In this white paper, we’ll share some of the challenges that arise because of the impedance mismatch between HDFS and S3, the expectations of analytics workloads of the object store, and how Alluxio with EMR addresses them.

Tags: , , ,

Building Fast SQL Analytics with Presto, Alluxio, and S3

Alluxio Community Office Hour *

Learn how to set up Presto with Alluxio such that Presto jobs can seamlessly read from and write to S3.
Compare the performance between Presto on S3 with Presto and Alluxio on S3.