INTERACTIVE ANALYTICS WITH presto AND alluxio
Presto with Alluxio brings together two open source technologies to give you better performance and multi-cloud capabilities for interactive analytic workloads. Presto’s open source distributed SQL query engine coupled with Alluxio enables true separation of storage and compute for data locality and provides memory speed response time and aggregate data from any file or object store.
DATA ENGINEERING PROBLEMS FOR SQL workloadS
While Presto is great for interactive SQL for analytics, object stores and remote data can significantly affect performance
Network latency to query remote data is very high, making interactive Presto on remote datasets unattainable
Data copies made across environments must be kept in sync, which means more data storage and transfer costs
Slow, inconsistent query performance can occur even on frequently queried data due to slow metadata operations & remote data
AWS S3, Google Cloud Storage, and other object stores are not designed for analytical workloads
presto with alluxio, better together
Presto with Alluxio is a truly separated compute and storage stack, enabling interactive big data analytics on any file or object store.
Alluxio provides a multi-tiered layer for Presto caching, enabling consistent high performance with jobs that run up to 10x faster
Alluxio makes the important data local to Presto, so there are no copies to manage (and lower costs)
Alluxio connects to a variety of storage systems and clouds so Presto can query data stored anywhere
“We use Alluxio to accelerate our ad hoc and real-time analytics with Presto. With Alluxio bringing data locality to Presto, we achieved a 10x performance gain on our queries on data in remote Hadoop clusters. We are excited that the Alluxio and Presto teams will be working closer together to benefit the entire stack.”
– Wensheng Wang, Big Data Platform Architect at JD.com
Accelerating presto with alluxio caching
Presto SQL caching on any cloud
Reading cloud data into Presto and enabling data sharing is automated and transparent with Alluxio. Alluxio can be deployed colocated with Presto and be backed by a mounted remote storage.
Hybrid cloud analytics
Integrate on-prem data stores like HDFS with Alluxio and Presto and get high performance in your hybrid cloud environment. Burst Presto into the cloud on-demand, when you need it.
Cross datacenter analytics
Access data anywhere it’s located – across regions, sites, or datacenters, in HDFS or object stores – for high performance analytics anywhere.
Want help getting started on Presto caching with Alluxio? Looking for feedback on your project’s architectural design?
docs: getting started with Presto and alluxio
Learn how to run Presto to query Alluxio as a distributed cache layer, where the data sources can be AWS S3, Azure blob store, HDFS or many others. Alluxio helps Presto access data regardless of the data source and transparently caches the data frequently accessed (e.g., tables commonly used) into Alluxio distributed storage. Co-locating Alluxio workers with Presto workers can benefit data locality and reduce the I/O access latency especially when data is remote or network is slow or congested.
See the docs >
Leading online retailer JD.com built an ad-hoc SQL query engine to support 400,000 jobs (15+ PB) daily, on a system with more than 15,000 cluster nodes and a total capacity of 210 PB. Two challenges they faced were around Presto workers reading remotely from HDFS datanodes and a large query variance. With Alluxio and Presto together, JD.com has seen a 10x performance improvement, including enhanced syncing for better consistency between Alluxio/Presto and HDFS.
See the slides >
Online gaming company Netease, the operator of popular titles like “World of Warcraft” and “Hearthstone”, needed a data platform to handle 30TB of raw data collected daily. That raw data is processed in ODS tables by ETL jobs which makes it an even larger amount of data. To support high performance ad hoc queries, they turned to Presto and Alluxio to speed up response time of queries for their massive datasets.
See the benchmarks >