accelerating spark workloads with alluxio
Spark with Alluxio brings together two open source technologies born out of UC Berkeley AMPLab to give you better performance with Alluxio’s data caching layer and enable hybrid cloud environments for Spark jobs running in the cloud and data on-prem. Alluxio brings back data locality to Spark’s distributed analytics engine in disaggregated environments and provides an intelligent and highly available data tier for Spark.
DATA ENGINEERING PROBLEMS FOR spark workloadS
Spark provides executor level caching, but it is limited by garbage collection. For larger datasets, using the Spark cache approach doesn’t work. Other problems may include:
Network latency to query remote data is very high, making interactive Spark on remote datasets unattainable
Many copies of data need to be created across environments, making it hard to manage and track
Metadata operations like list and rename can be slow and expensive when running on object storage
No good approach for sharing large datasets across multiple jobs in a data pipeline
ALLUXIO + SPARK USE CASES
Spark with Alluxio gives you data locality in disaggregated environments and a highly available data tier for Spark.
Data sharing between jobs
Inter-process sharing can be slowed down by network I/O. With Alluxio, storage and compute are separated, enabling inter-process sharing to happen at memory speed.
Data resilience during application crashes
When storage and compute are running in the same process, a process crash requires network I/O to re-read the data. With Alluxio, storage and compute are separated, so the data is still available.
Want help getting started on Spark with Alluxio? Looking for feedback on your project’s architectural design?
deploying alluxio with spark
Depending on your environment, you can deploy Alluxio with Spark in the cloud, in a hybrid environment, or with AWS EMR Spark.
AWS EMR Spark
Alluxio enables users to increase performance and complexity of analytic workloads running on AWS EMR using S3 as the storage, eliminating the need to use a complex HDFS layer. See the tutorial on how to get started.
Single Cloud Caching
Reading cloud data into Spark and enabling data sharing is automated and transparent with Alluxio. Alluxio can be deployed colocated with Spark and be backed by a mounted remote storage. Get started with Spark caching and Alluxio in 5 minutes.
Hybrid Cloud
Integrate on-prem data stores like HDFS with Alluxio and Spark and get high performance in your hybrid cloud environment. Burst Spark into the cloud on-demand, when you need it.
docs: getting started with spark and alluxio
See how to implement Alluxio as a data access layer so Spark applications can transparently access data in many different types and instances of persistent storage services (e.g., AWS S3 buckets, Azure Object Store buckets, remote HDFS deployments).
Running EMR Spark with Alluxio
Alluxio can run on EMR to provide functionality above what EMRFS currently provides. Aside from the added performance benefits of caching, Alluxio also enables users to run Spark jobs against on-premise storage or even a different cloud provider’s storage i.e. GCS, Azure Blob Store.
case studies
Digital marketing company Bazaarvoice leverages Alluxio for its tiered storage architecture with Apache Spark and AWS S3 to maximize performance and minimize operating costs with running analytics on AWS EC2. With Alluxio, they see a reduction in execution time by 10-15x.
Read the blog >
Tencent is a leader in social networking, gaming, e-commerce, mobile and web portal. Tencent News leverages Alluxio with Apache Spark to create a scalable, robust, and performant architecture to provide the best experience to more than 100 million monthly active users of Tencent News.
See the blog >