HOW TO USE Presto on AWS EMR
Big data requires scalable, resilient storage, but it’s only valuable if you can derive insights from it. Amazon EMR provides scalable compute in the cloud, including interactive queries with Presto, for big data in S3 storage.
What is Presto?
Presto is an open-source, distributed SQL query engine, optimized for running interactive queries on large data sets and across multiple sources. Designed to run on a computing cluster, Presto leverages MPP (Massively Parallel Programming) principles, splitting queries into tasks distributed across multiple worker nodes, and operates in memory to enable fast query execution. This high performance enables users to interact with data in real time.
Although originally built to work with Hadoop, Presto includes connectors for a wide range relational and non-relational data sources, including PostgreSQL, MySQL, SQL Server, Cassandra, Redis and MongoDB as well as columnar file formats such as Parquet and Avro. Presto fully supports ANSI SQL, making it ideal for data analysts and developers.
What is the history of Presto?
Presto was developed by Facebook in 2012 to improve performance for interactive queries on their 300-petabyte data warehouse. Facebook had previously built Apache Hive to provide a SQL-like interface for querying data stored in Hadoop. While queries with Hive were easier to write than Java MapReduce jobs, performance was too slow for interactive queries, which require near-real-time updates. By operating in memory rather than writing intermediate steps to disk, Presto significantly improved performance for interactive queries.
Presto was rolled out within Facebook in spring 2013 and was open sourced under the Apache license in November that year. A number of other companies started using Presto and contributing to its development. In September 2019, Facebook announced that it was handing over governance of PrestoDB to the newly created Presto Foundation, hosted under the Linux Foundation. The Presto Foundation’s purpose is to create a neutral, open community with transparent technical leadership and direction. Under this model, Facebook is now one contributor among many.
Who uses Presto?
Facebook employees continue to use Presto to run over 3000 queries and process over a petabyte of data each day. Other notable users of the query engine include Netflix, Uber, Twitter, Atlassian, Alibaba and Nasdaq.
What is EMR in AWS?
Elastic MapReduce (EMR) from Amazon Web Services provides on demand, scalable Hadoop clusters for processing large data sets. EMR uses Amazon EC2 instances for fast provisioning, scalability and high availability of compute power. With EMR, users can spin up Hadoop clusters and start processing data in minutes, without having to manage the configuration and tuning of each cluster node required for an on-premise Hadoop installation. Once the analysis is complete, clusters can be terminated instantly, saving on the cost of compute resources.
As a Hadoop distribution, EMR incorporates various Hadoop tools, including Presto, Spark and Hive, so that users can query and analyze their data. With EMR, data can be accessed directly from AWS S3 storage using EMRFS (Elastic MapReduce File System) or copied into HDFS (Hadoop Distributed File System) on each cluster instance for the lifetime of the cluster. In order to persist data stored in HDFS, it must be manually copied to S3 before the cluster is terminated.
Using Presto on EMR
With Presto and EMR, users can run interactive queries on large data sets with minimal setup time. EMR handles the provisioning, configuration and tuning of Hadoop clusters. Providing you launch a cluster with Amazon EMR 5.0.0 or later, Presto is included automatically as part of the cluster software. Earlier versions of EMR include Presto as a sandbox application.
AWS EMR and Presto Configurations
As a query engine, Presto does not manage storage of the data to be processed; it simply connects to the relevant data source in order to run interactive queries. In EMR, data is either copied to HDFS on each cluster instance or read from S3. With EMR 5.12.0 onwards, by default Presto uses EMRFS to connect to Amazon S3. EMRFS extends the HDFS API to S3, giving Hadoop applications, like Presto, access to data stored in S3 without additional configuration or copying of the data. For earlier versions of EMR, data in S3 can be accessed using Presto’s Hive connector.
While S3 provides scalable, resilient, cost-effective cloud storage which interfaces with EMR using EMRFS, processing data stored in S3 with EMR and Presto carries a performance penalty. With EMRFS, data is not cached locally and must be fetched from S3 servers. These reads may be further impacted by throttling on S3. Because the data in S3 is not cached locally, metadata stored on the EMR clusters is only eventually consistent with the data in S3. The resulting sync jobs further degrade the performance of Presto on EMR clusters. Adding Alluxio’s data orchestration platform to your EMR clusters not only provides a significant performance improvement, but also extends the scope of your analytics stack.
By adding a multi-tier cache, Alluxio minimizes slower reads to S3 storage without requiring manually copying of data to each cluster instance. Because both the data and metadata are available locally, Alluxio ensures strong data consistency, optimizing the environment for Hadoop tools. If outputs need to be written back to S3 storage, you can specify whether Alluxio does so synchronously or asynchronously, according to the level of resilience required.
Alluxio also adds an abstraction layer to EMR, by providing a global namespace, so you can connect your EMR clusters to multiple S3 buckets from different AWS accounts and different regions. Alluxio can be included in EMR clusters by subscribing to the Alluxio AMI (Amazon Machine Image) and using the bootstrap action to configure Alluxio with Presto.
Amazon EMR provides a highly scalable analytics stack for big data with industry-leading data processing tools, including Presto. Combining EMR and Presto with Alluxio enables interactive queries on large data sets stored in multiple Amazon S3 buckets without sacrificing performance.