rdd Archives | Alluxio

Alluxio on EMR: Fast Storage Access and Sharing for Spark Jobs

June 11, 2019 By Chengzhi Zhao

Traditionally, if you want to run a single Spark job on EMR, you might follow the steps: launching a cluster, running the job which reads data from storage layer like S3, performing transformations within RDD/Dataframe/Dataset, finally, sending the result back to S3. You end up having something like this.
If we add more Spark jobs across multiple clusters, you could have something like this.

Effective Spark With Alluxio

Spark Summit East * February 8, 2017

In this talk, we briefly introduce Alluxio, present several ways how Alluxio can help Spark be more effective, show benchmark results with Spark RDDs and DataFrames, and describe production deployments both Alluxio and Spark working together. In the meantime, we will provide live demos for some of the use cases.

Best Practices for Using Alluxio with Apache Spark

Spark Summit San Francisco 2017 * June 6, 2017

Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system and leverages memory for storing data and accelerating access to data in different storage systems. Many organizations and deployments use Alluxio with Apache Spark, and some of them scale out to over PB’s of data. Alluxio can enable Spark to be even more effective, in both on-premise deployments and public cloud deployments. Alluxio bridges Spark applications with various storage systems and further accelerates data intensive applications. In this talk, we briefly introduce Alluxio, and present different ways how Alluxio can help Spark jobs. We discuss best practices of using Alluxio with Spark, including RDDs and DataFrames, as well as on-premise deployments and public cloud deployments.

Best Practices for Using Alluxio with Spark

Strata Data Conference New York 2017 * September 27, 2017

Haoyuan Li and Cheng Chang explain how Alluxio makes Spark more effective in both on-premises and public cloud deployments and share production deployments of Alluxio and Spark working together. Along the way, they discuss best practices for using Alluxio with Spark, including with RDDs and DataFrames.

Effective caching for Spark RDDs with Alluxio

August 24, 2018 By Gene Pang and Pei Sun

Recently, Qunar deployed Alluxio with Spark in production and found that Alluxio enables Spark streaming jobs to run 15x to 300x faster. In their case study, they described how Alluxio improved their system architecture, and mentioned that some existing Spark jobs would slow down or would never finish because they would run out of memory. After using Alluxio, those jobs were able to finish, because the data could be stored in Alluxio, instead of within Spark.
In this blog, we show by saving RDDs in Alluxio, Alluxio can keep larger data sets in-memory for faster Spark applications, as well as enable sharing of RDDs across separate Spark applications.

Alluxio at Spark Summit EU 2017

October 26, 2017 by Gene Pang

We briefly introduce Alluxio and present different ways Alluxio can help Spark jobs, along with best practices. We also discuss how Alluxio can be deployed and used with a Spark data processing pipeline in the cloud.

Tags: alluxio engineering, apache spark, architecture, aws s3, cloud, cloud storage, conference, developer tips, hybrid cloud, machine learning, rdd

Alluxio at Spark Summit East 2017

February 9, 2017 by Haoyuan Li & William Callaghan [eSentire]

In this talk, we briefly introduce Alluxio, present several ways how Alluxio can help Spark be more effective, show benchmark results with Spark RDDs & DataFrames, and describe production deployments with both Alluxio and Spark working together.

Tags: alluxio engineering, apache spark, architecture, big data, cloud, compute storage separation, conference, data, performance, rdd, spark, storage

Getting Started with Alluxio and Spark

April 5, 2016 By Adit Madan

Alluxio provides Spark with a reliable data sharing layer, enabling Spark to excel at performing application logic while Alluxio handles storage.

Tachyon: A Memory Centric Storage System for Big Data Computing

March 9, 2016

Strata+Hadoop World 2016 – Tachyon, a memory-centric fault-tolerant distributed storage system. An introduction of architecture, performance evaluation, and real world use cases.

Tags: architecture, big data, data, distributed systems, performance, rdd, spark, storage

Tag: rdd