Many Spark users may not be aware of the differences in memory utilization between caching data directly in-memory into the Spark JVM versus storing data off-heap via an in-memory storage service like Alluxio. In this office hour, I will highlight the two approaches with a demo and open up for discussions
Slides from our latest talks
Alluxio, an open source data orchestration technology, helping speed up Dataproc workloads by providing a distributed caching layer in the Dataproc Cluster.
This talk describes a stack of open-source projects to serve high-concurrent and low-latency SQL queries using Presto with Alluxio on big data in the cloud. Deploying Alluxio as a data orchestration layer to access cloud storage object storage (e.g., AWS S3), this architecture greatly enhances the data locality of Presto with distributed and cross-query caching, thus avoids reading same data repeatedly from the cloud storage.
JD.com is China’s largest online retailer. It uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component.
The DBS team was tasked to solve their compute capacity problem. They wanted to provide faster insights and analyze data for a range of use cases but didn’t have the ability to scale compute elastically on-prem.
One use case that challenged them was customer call analysis. With the millions of customer calls they get every year, DBS manages over 50TB of customer data and audio files. This data needed to reside on-prem for compliance reasons. With on-prem compute limitations, they looked to the public cloud to analyze this data and selected “zero-copy” bursting as the best approach.
In this talk, HY discussed the key challenges and trends impacting data engineering, and explores the concept of Data Orchestration.
This session talks about challenges associated with querying diverse data sources at Walmart and how those are tackled using Presto & Alluxio.
In this talk, we share our lessons in building and rebuilding our monitoring systems and data platforms at Electronic Arts (EA).
Best use cases for Presto from the Data Engineer’s perspective. Also hear about recent Presto advancements such as Cost-Based Optimizer, Kubernetes-native deployment and the project roadmap going forward.