INTERACTIVE ANALYTICS WITH presto AND alluxio

Presto with Alluxio brings together two open source technologies to give you better performance and multi-cloud capabilities for interactive analytic workloads. Presto’s open source distributed SQL query engine coupled with Alluxio enables true separation of storage and compute for data locality and provides memory speed response time and aggregate data from any file or object store.

DATA ENGINEERING PROBLEMS FOR SQL workloadS

While Presto is great for interactive SQL for analytics, object stores and remote data can significantly affect performance

Network latency to query remote data is very high, making interactive Presto on remote datasets unattainable

Data copies made across environments must be kept in sync, which means more data storage and transfer costs

Slow, inconsistent query performance can occur even on frequently queried data due to slow metadata operations & remote data

AWS S3, Google Cloud Storage, and other object stores are not designed for analytical workloads

presto with alluxio, better together

Presto with Alluxio is a truly separated compute and storage stack, enabling interactive big data analytics on any file or object store.

Alluxio provides a multi-tiered layer for Presto caching, enabling consistent high performance with jobs that run up to 10x faster

Alluxio makes the important data local to Presto, so there are no copies to manage (and lower costs)

Alluxio connects to a variety of storage systems and clouds so Presto can query data stored anywhere

getting started tutorials for alluxio and presto

See how Alluxio speeds up Presto queries, even on remote data!

“We use Alluxio to accelerate our ad hoc and real-time analytics with Presto. With Alluxio bringing data locality to Presto, we achieved a 10x performance gain on our queries on data in remote Hadoop clusters. We are excited that the Alluxio and Presto teams will be working closer together to benefit the entire stack.”

– Wensheng Wang, Big Data Platform Architect at JD.com

Accelerating presto with alluxio caching

Presto SQL caching on any cloud

Reading cloud data into Presto and enabling data sharing is automated and transparent with Alluxio. Alluxio can be deployed colocated with Presto and be backed by a mounted remote storage.

Hybrid cloud analytics

Integrate on-prem data stores like HDFS with Alluxio and Presto and get high performance in your hybrid cloud environment. Burst Presto into the cloud on-demand, when you need it.

Cross datacenter analytics

Access data anywhere it’s located – across regions, sites, or datacenters, in HDFS or object stores – for high performance analytics anywhere.

Want help getting started on Presto caching with Alluxio? Looking for feedback on your project’s architectural design?

docs: getting started with Presto and alluxio

Learn how to run Presto to query Alluxio as a distributed cache layer, where the data sources can be AWS S3, Azure blob store, HDFS or many others. Alluxio helps Presto access data regardless of the data source and transparently caches the data frequently accessed (e.g., tables commonly used) into Alluxio distributed storage. Co-locating Alluxio workers with Presto workers can benefit data locality and reduce the I/O access latency especially when data is remote or network is slow or congested.

See the docs >

case studies

JD.com

Leading online retailer JD.com built an ad-hoc SQL query engine to support 400,000 jobs (15+ PB) daily, on a system with more than 15,000 cluster nodes and a total capacity of 210 PB. Two challenges they faced were around Presto workers reading remotely from HDFS datanodes and a large query variance. With Alluxio and Presto together, JD.com has seen a 10x performance improvement, including enhanced syncing for better consistency between Alluxio/Presto and HDFS.
See the slides >

NetEase

Online gaming company Netease, the operator of popular titles like “World of Warcraft” and “Hearthstone”, needed a data platform to handle 30TB of raw data collected daily. That raw data is processed in ODS tables by ETL jobs which makes it an even larger amount of data. To support high performance ad hoc queries, they turned to Presto and Alluxio to speed up response time of queries for their massive datasets.
See the benchmarks >