Building Fast SQL Analytics on Anything with Presto, Alluxio
This talk describes a stack to combine Presto, Alluxio, and Cloud object storage systems (e.g.,AWS S3) for high-concurrent and low-latency SQL queries on big data on the cloud. Presto, an open-source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Alluxio is an open-source data orchestration that brings data closer to compute and provides a unified data access layer at in-memory speeds. Presto can use Alluxio as a distributed caching tier on top of S3 for the hot data to query, avoiding reading data repeatedly from the cloud.
This talk covers:
- The architecture of Presto, its separation of compute and storage, cloud-readiness, recent advancements in the project such as Cost-Based Optimizer and Kubernetes Support.
- An overview of Alluxio’s key concepts, architecture and data flow,
- Presto and Alluxio production use cases running hundreds of nodes, including ING Bank, JD.com, and NetEase Games.
Query Anything, Anywhere with Kubernetes
Building Cloud Native Analytical Pipelines on AWS
With the ease and flexibility that the cloud brings, many data platform teams are building their data pipelines on Amazon AWS leveraging many of the services it provides. For frameworks like Apache Spark and Hive, Amazon EMR that includes the Hadoop stack, greatly simplifies and speeds up the installation and configuration of clusters. Amazon S3 also provides a cost-effective and easy way to store large amounts of data. However, there are still challenges that data engineers see with workloads that are latency sensitive, need data sharing across pipelines, or need constant synchronization with S3.
In this talk, Irene shares her experience with building data pipelines on AWS and how Alluxio, a data orchestration layer can greatly simplify these challenges while eliminating problems caused by S3 throttling or slowdowns.