accelerating spark workloads with alluxio

Spark with Alluxio brings together two open source technologies born out of UC Berkeley AMPLab to give you better performance with Alluxio’s data caching layer and enable hybrid cloud environments for Spark jobs running in the cloud and data on-prem. Alluxio brings back data locality to Spark’s distributed analytics engine in disaggregated environments and provides an intelligent and highly available data tier for Spark.

DATA ENGINEERING PROBLEMS FOR spark workloadS

Spark provides executor level caching, but it is limited by garbage collection. For larger datasets, using the Spark cache approach doesn’t work. Other problems may include:

Network latency to query remote data is very high, making interactive Spark on remote datasets unattainable

Many copies of data need to be created across environments, making it hard to manage and track

Metadata operations like list and rename can be slow and expensive when running on object storage

No good approach for sharing large datasets across multiple jobs in a data pipeline

ALLUXIO + SPARK USE CASES

Spark with Alluxio gives you data locality in disaggregated environments and a highly available data tier for Spark.

  • Alluxio provides a multi-tiered layer caching for Spark, providing strong consistency for metadata operations and faster performance
  • Alluxio provides fast storage access and sharing for Spark jobs on EMR so you don’t have to manage manual distcp
  • Alluxio makes the important data local to Spark, so there are no copies to manage (and lower costs)
  • Alluxio connects to a variety of storage systems and clouds so Spark can query data stored anywhere

Data sharing between jobs

Inter-process sharing can be slowed down by network I/O. With Alluxio, storage and compute are separated, enabling inter-process sharing to happen at memory speed.

Data resilience during application crashes

When storage and compute are running in the same process, a process crash requires network I/O to re-read the data. With Alluxio, storage and compute are separated, so the data is still available.

Want help getting started on Spark with Alluxio? Looking for feedback on your project’s architectural design?

deploying alluxio with spark

Depending on your environment, you can deploy Alluxio with Spark in the cloud, in a hybrid environment, or with AWS EMR Spark.

AWS EMR Spark
Alluxio enables users to increase performance and complexity of analytic workloads running on AWS EMR using S3 as the storage, eliminating the need to use a complex HDFS layer. See the tutorial on how to get started.

Single Cloud Caching

Reading cloud data into Spark and enabling data sharing is automated and transparent with Alluxio. Alluxio can be deployed colocated with Spark and be backed by a mounted remote storage. Get started with Spark caching and Alluxio in 5 minutes.

Hybrid Cloud

Integrate on-prem data stores like HDFS with Alluxio and Spark and get high performance in your hybrid cloud environment. Burst Spark into the cloud on-demand, when you need it.

docs: getting started with spark and alluxio

Running Spark with Alluxio

See how to implement Alluxio as a data access layer so Spark applications can transparently access data in many different types and instances of persistent storage services (e.g., AWS S3 buckets, Azure Object Store buckets, remote HDFS deployments).

Running EMR Spark with Alluxio

Alluxio can run on EMR to provide functionality above what EMRFS currently provides. Aside from the added performance benefits of caching, Alluxio also enables users to run Spark jobs against on-premise storage or even a different cloud provider’s storage i.e. GCS, Azure Blob Store.

case studies

Digital marketing company Bazaarvoice leverages Alluxio for its tiered storage architecture with Apache Spark and AWS S3 to maximize performance and minimize operating costs with running analytics on AWS EC2. With Alluxio, they see a reduction in execution time by 10-15x.
Read the blog >

Tencent

Tencent is a leader in social networking, gaming, e-commerce, mobile and web portal. Tencent News leverages Alluxio with Apache Spark to create a scalable, robust, and performant architecture to provide the best experience to more than 100 million monthly active users of Tencent News.
See the blog >