This talk describes a stack of open-source projects to serve high-concurrent and low-latency SQL queries using Presto with Alluxio on big data in the cloud. Deploying Alluxio as a data orchestration layer to access cloud storage object storage (e.g., AWS S3), this architecture greatly enhances the data locality of Presto with distributed and cross-query caching, thus avoids reading the same data repeatedly from the cloud storage.
Slides from our latest talks
Google Cloud Dataproc is a widely used fully managed Spark and Hadoop service to run big data analytics and compute workloads in the cloud. Services like Dataproc reduce hardware spend, eliminate the need to overbuy capacity, and provide business agility. Yet users still face challenges for performance sensitive workloads or workloads running on remote data.
Alluxio is an open source cloud data orchestration platform that increases performance of analytic workloads running on Dataproc by intelligently caching data and bringing back lost data locality. Alluxio also enables users to run compute workloads against on-prem storage like Hadoop HDFS without any app changes.
Chris Crosbie and Roderick Yao from the Google Dataproc team and Dipti Borkar of Alluxio demo how to set up Google Cloud Dataproc with Alluxio so jobs can seamlessly read from and write to Cloud Storage. They also show how to run Dataproc Spark against a remote HDFS cluster.
If you’re a MapR user, you might have concerns with your existing data stack. Whether it’s the complexity of Hadoop, financial instability and no future MapR product roadmap, or no flexibility when it comes to co-locating storage and compute, MapR may no longer be working for you.
Alluxio can help you migrate to a modern, disaggregated data stack using any object store with the similar performance of Hadoop plus significant cost savings.
Join us for this tech talk where we’ll discuss how to separate your compute and storage on-prem and architect a new data stack that makes your object store the core. We’ll show you how to offload your MapR/HDFS compute to any object store and how to run all of your existing jobs as-is on Alluxio + object store.
Many Spark users may not be aware of the differences in memory utilization between caching data directly in-memory into the Spark JVM versus storing data off-heap via an in-memory storage service like Alluxio. In this office hour, I will highlight the two approaches with a demo and open up for discussions
Alluxio, an open source data orchestration technology, helping speed up Dataproc workloads by providing a distributed caching layer in the Dataproc Cluster.
This talk describes a stack of open-source projects to serve high-concurrent and low-latency SQL queries using Presto with Alluxio on big data in the cloud. Deploying Alluxio as a data orchestration layer to access cloud storage object storage (e.g., AWS S3), this architecture greatly enhances the data locality of Presto with distributed and cross-query caching, thus avoids reading same data repeatedly from the cloud storage.
JD.com is China’s largest online retailer. It uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component.
The DBS team was tasked to solve their compute capacity problem. They wanted to provide faster insights and analyze data for a range of use cases but didn’t have the ability to scale compute elastically on-prem.
One use case that challenged them was customer call analysis. With the millions of customer calls they get every year, DBS manages over 50TB of customer data and audio files. This data needed to reside on-prem for compliance reasons. With on-prem compute limitations, they looked to the public cloud to analyze this data and selected “zero-copy” bursting as the best approach.
In this talk, HY discussed the key challenges and trends impacting data engineering, and explores the concept of Data Orchestration.