Introduction to Presto and commonly asked questions


Presto with Alluxio brings together two open source technologies to give you better performance and multi-cloud capabilities for interactive analytic workloads. Presto’s open source distributed SQL query engine coupled with Alluxio enables true separation of storage and compute for data locality and provides memory speed response time and aggregate data from any file or object store. 

What is Presto and what are its use cases?

Presto was originally designed at Facebook as they had a need to run interactive queries against large data warehouses in Hadoop. It was designed specifically to fill the gap/need to be able to run fast queries against data warehouses storing petabytes of data. 

Presto is a SQL based querying engine that uses an MPP architecture to scale out. Because it is a querying engine only, it separates compute and storage relying on connectors to integrate with other data sources to query against. In this capacity, it excels against other technologies in the space providing the ability to query against:

  • Traditional Databases
    • MySQL
    • PostGres
    • SQL Server
  • Non-relational Databases
    • Mongodb
    • Redis
    • Cassandra
  • Columnar file formats like ORC, Parquet and Avro – stored on:
    • Amazon S3
    • Google Cloud Store
    • Azure Blog Store
    • HDFS
    • Clustered file systems

You can see how to deploy Presto in the documentation.

What is a Presto Database?

Presto’s distributed system runs on Hadoop and uses a classic massively parallel processing (MPP) database management system (you might hear some people call it PrestoDB). It has one coordinator node (master) working in synch with multiple other workers. After users submit their SQL query through a client to the Presto coordinator, it uses a custom query engine to parse, plan and schedule a distributed query plan across all its worker nodes. Presto is built with a familiar SQL query interface that allows you to easily run interactive SQL on Hadoop. It supports standard ANSI SQL semantics, including complex queries, aggregations, and joins. 

How does Presto cache and store data?

Presto stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources. 

This is where a technology like Alluxio might help. Alluxio provides a multi-tiered layer for Presto caching and connects to a variety of storage systems and clouds so Presto can query data stored anywhere.

What is Apache Presto?

Presto or PrestoDB is a distributed SQL query engine that is used best for running interactive analytic workloads in your big data environment. Presto allows you to query against many different data sources whether its HDFS, MySQL, Cassandra, or Hive. Presto is built on Java and can also integrate with other third party data sources or infrastructure components. 

The Presto query execution model is split up into a few different phases: Statement, Query, Stage, Task, and Splits. This article will briefly discuss each to explain what Presto is and what it is not. After you issue a SQL query (or Statement) to the query engine, it parses and converts it to a query. When Presto executes the query it does so by breaking it up into multiple stages. Stages are then split up into tasks across the multiple Presto workers. Think of tasks as the ones that are essentially doing the work and processing. Tasks use an Exchange in order to share data between tasks and outputs of processes. 

Is Presto in-memory?

Memory used by Presto is usually in the context of the JVMs itself, depending on query sizes and complexity of tasks you can allocate more or less memory to the JVMs. Presto itself, however, doesn’t use this memory to cache any data. 

Does presto use MapReduce?

Similarly, some users are used to Hive’s execution model that breaks down a query through MapReduce to work on constituent data in HDFS. Presto will leverage its own mechanism to break down and fan out the work of a given query and it does not rely on MapReduce to do so.

What is Presto in Big Data?

Big data is a broad term. Generically, it applies to ways to analyze, extract and deal with data sets that are too large or complex to be dealt with by traditional data processing application software. Challenges in this space include:

  • Capturing data
  • Storing data
  • Analysis
  • Search
  • Sharing
  • Transfer
  • Visualization
  • Querying
  • Updating

Apache Presto/Starburst Presto falls into the querying vertical of big data. Competitors in the space also include technologies like Hive, Pig, Hbase, Druid, Dremio, Impala, Spark SQL.

Many of the technologies in the querying vertical of big data are designed within or to work directly against the Hadoop ecosystem.

what is the difference between presto, prestoSQL and starburst presto?

Presto originated from Facebook and was built specifically for Facebook. PrestoSQL is backed by the Presto foundation, who made it more broad for wider adoption. The Presto distribution from Starburst is even more optimized with enterprise features like the cost-based optimizer.

What is Presto Hive?

Presto Hive typically means Presto with the Hive connector. The connector allows querying of data that is stored in a Hive data warehouse. Hive is a combination of data files and metadata. The data files themselves can be of different formats and typically are stored in an HDFS or S3-type system. The metadata is information about the data files and how they are mapped to schemas and tables. This data is stored in a database such as MySQL and accessed via the Hive metastore service. Presto via the Hive connector is able to access both these components. 

One thing to note is that Hive also has its own query execution engine. Thus, there’s a difference between running a Presto query against a Hive-defined table and running the same query directly though the Hive CLI. 

Does Presto use Spark?

Presto and Spark are two different query engines. You can read more details about the differences in this Quora thread, but at a high level Spark supports complex/long running queries while Presto is better for short interactive queries.

Does Presto cache data?

Out of the box, no. This is where a technology like Alluxio comes in. Alluxio provides a read/write block-level caching engine that connects to a variety of storage systems including S3 and HDFS. You can read more in this blog.

What is Teradata Presto?

Teradata distributes open-source Presto and works closely with Starburst Data which also provides enterprise distribution and support of Presto. Both companies support open-source Presto and its community.

Does Presto use YARN?

Presto is not dependent on Yarn as a resource manager. Instead it leverages a very similar architecture with dedicated Coordinator and Worker nodes that are not dependent on a Hadoop infrastructure to be able to run.

Presto + Alluxio

Presto with Alluxio is a truly separated compute and storage stack, enabling interactive big data analytics on any file or object store.

Alluxio provides a multi-tiered layer for Presto caching, enabling consistent high performance with jobs that run up to 10x faster.

Alluxio also makes the important data local to Presto, so there are no copies to manage (and lower costs).

And last, Alluxio connects to a variety of storage systems and clouds so Presto can query data stored anywhere.

You can read more about Alluxio and Presto together at https://www.alluxio.io/presto/