TUTORIAL: GETTING STARTED WITH Starburst Presto & Alluxio on AWS using Cloud Formation Template


5 min TutoriaL

This tutorial walks you through the steps to create a Starburst Presto and Alluxio cluster using a combined AWS Cloud Formation Template CloudFormation template. The CFT lets you easily dive into an interactive environment where you can explore Alluxio, run queries with Presto, and experience the performance benefits of using Alluxio in a big data software stack. It will also show how Alluxio can improve Presto’s query performance by reading through Alluxio to access locally cached data, that is stored in an Amazon S3 data lake.


Outline

  • Prerequisites
  • Deploying Alluxio and Presto joint cluster
  • Access the Alluxio Presto cluster
  • Explore Alluxio
  • Run queries with Presto on Alluxio
  • Presto on Alluxio advantages

PREREQUISITES

Familiarity with EC2 and CloudFormation is helpful. The tutorial deploys a Presto and Alluxio joint cluster using CloudFormation Template through the web console step by step.

Note that the launched EC2 instances inside the cluster does not qualify for free usage tier because the instances need to have sufficient resources to execute the workload. The instance type we will be using is r4.2xlarge.


Deploying A Presto & Alluxio CLUSTER using CFT

The detailed explanation of selecting template and specifying cluster details is documented in the Presto with caching CFT deployment docs.
You can also find the CloudFormation template on the AWS Marketplace.

Follow the deployment instructions to set up your cluster.

For this tutorial, it’s suggested to use the following parameters to create a cluster with enough resources to run the example queries:

  • SecurityGroup: have a security group with the following ports opened
    • Port 22 for SSH
    • Port 19999 and 30000 for accessing the Alluxio web UI
    • Port 8080 and 8088 for accessing Presto web UI
  • CoordinatorInstanceType: r4.2xlarge
  • WorkersInstanceType: r4.2xlarge
  • WorkersCount:1
  • HACoordinatorsCount: 1

Access the Presto ALLUXIO cluster

In order to access the Presto Alluxio cluster, you must obtain the address of the Presto coordinator. Alluxio CLIs can be run in any of the cluster nodes, but Presto CLI can only be run on the active Presto coordinator node.

  • Navigate to the CloudFormation Console, select your “Stack Name” and click the associated tab labeled “Outputs”. Find the private IP address of the Presto coordinator. 
  • Navigate to the EC2 console, select “Instances”, paste the private IP address of the Presto coordinator, and find the corresponding public DNS name of the Presto coordinator.

  • Open a terminal window and establish a connection
$ssh -i /path/to/mykeypair.pem ec2-user@<master_public_DNS>

For more details, see Accessing the Presto Cluster.

Explore Alluxio 

We’ll use a combination of Alluxio web UI and the Alluxio CLI to explore the Alluxio cluster status.

Inside one of the cluster instances, the output of alluxio fsadmin report will show the private DNS of the Alluxio leader master. 

The Alluxio leader master can be located on the same node as the Presto leader coordinator. If not, go to EC2 console and find the corresponding public DNS.

Make sure your chosen security group open the port 19999 for Alluxio master web UI and open the web UI.

You can see the current Alluxio mounts by running alluxio fs mount from within the instance.

$ alluxio fs mounts3://alluxio-presto-bucket/dir  on / (s3, capacity=-1B, used=-1B, not read-only, not shared, properties={})

You can see the current Alluxio files from the Alluxio web UI or by running alluxio fs ls / from within the instance.

$ alluxio fs ls /15928084       PERSISTED 10-31-2019 20:54:48:000   0% /data.gz

The root directory contains the remote data that is located in the S3 address you chose as the AlluxioS3RootMount. 

Run queries with Presto on Alluxio

In this section we’re going to use Presto and Alluxio to show how Alluxio can massively decrease query times by reading cached data.

This guide focuses on using presto-cli; however, you can also use the Presto UI at the http://EC2_PUBLIC_DNS:8080 to view the status of your queries.

Step1: Create THE table IN alluxiO

From within the instance, launch the Presto CLI with Alluxio catalog:

presto-cli –catalog alluxio –debugpresto> 

Tip: You can exit at any time by typing exit;

The joint cluster is launched with two sample catalogs using hive metastore, one called hive, the other called alluxio. The two catalogs share the same underlying hive metastore but the alluxio catalog is preconfigured to use the Alluxio transparent Hive integration feature. This feature seamless integrated Alluxio with Presto and no prefix changes or path changes need to be made to the table location in the Hive Metastore. All the operations in the sample alluxio catalog will go through Alluxio instead of directly connecting to the remote data source. Those are the two main catalogs we are comparing in this tutorial.

presto> show catalogs; Catalog——— alluxio hive jmx system tpcds tpch(6 rows)

Create “ontime” schema inside alluxio catalog

presto> create schema if not exists ontime;use alluxio:ontime;

Create the table with an existing flights dataset in S3 bucket:


presto> CREATE TABLE IF NOT EXISTS flights_orc (                                                              
    year integer,                                                                                    
    quarter integer,                                                                                 
    month integer,                                                                                   
    dayofmonth integer,                                                                              
    dayofweek integer,                                                                               
    flightdate varchar,                                                                              
    uniquecarrier varchar,                                                                           
    airlineid integer,                                                                               
    carrier varchar,                                                                                 
    tailnum varchar,                                                                                 
    flightnum integer,                                                                               
    originairportid integer,                                                                         
    originairportseqid integer,                                                                      
    origincitymarketid integer,                                                                      
    origin varchar,                                                                                  
    origincityname varchar,                                                                          
    originstate varchar,                                                                             
    originstatefips integer,                                                                         
    originstatename varchar,                                                                         
    originwac integer,                                                                               
    destairportid integer,                                                                           
    destairportseqid integer,                                                                        
    destcitymarketid integer,                                                                        
    dest varchar,                                                                                    
    destcityname varchar,                                                                            
    deststate varchar,                                                                               
    deststatefips integer,                                                                           
    deststatename varchar,                                                                           
    destwac integer,                                                                                 
    crsdeptime integer,                                                                              
    deptime integer,                                                                                 
    depdelay integer,                                                                                
    depdelayminutes integer,                                                                         
    depdel15 integer,                                                                                
    departuredelaygroups integer,                                                                    
    deptimeblk varchar,                                                                              
    taxiout integer,                                                                                 
    wheelsoff integer,                                                                               
    wheelson integer,                                                                                
    taxiin integer,                                                                                  
    crsarrtime integer,                                                                              
    arrtime integer,                                                                                 
    arrdelay integer,                                                                                
    arrdelayminutes integer,                                                                         
    arrdel15 integer,                                                                                
    arrivaldelaygroups integer,                                                                      
    arrtimeblk varchar,                                                                              
    cancelled tinyint,                                                                               
    cancellationcode varchar,                                                                        
    diverted tinyint,                                                                                
    crselapsedtime integer,                                                                          
    actualelapsedtime integer,                                                                       
    airtime integer,                                                                                 
    flights integer,                                                                                 
    distance integer,                                                                                
    distancegroup integer,                                                                           
    carrierdelay integer,                                                                            
    weatherdelay integer,                                                                            
    nasdelay integer,                                                                                
    securitydelay integer,                                                                           
    lateaircraftdelay integer,                                                                       
    firstdeptime integer,                                                                            
    totaladdgtime integer,                                                                           
    longestaddgtime integer,                                                                         
    divairportlandings integer,                                                                      
    divreacheddest integer,                                                                          
    divactualelapsedtime integer,                                                                    
    divarrdelay integer,                                                                             
    divdistance integer,                                                                             
    div1airport varchar,                                                                             
    div1airportid integer,                                                                           
    div1airportseqid integer,                                                                        
    div1wheelson integer,                                                                            
    div1totalgtime integer,                                                                          
    div1longestgtime integer,                                                                        
    div1wheelsoff integer,                                                                           
    div1tailnum varchar,                                                                             
    div2airport varchar,                                                                             
    div2airportid integer,                                                                           
    div2airportseqid integer,                                                                        
    div2wheelson integer,                                                                            
    div2totalgtime integer,                                                                          
    div2longestgtime integer,                                                                        
    div2wheelsoff integer,                                                                           
    div2tailnum varchar,                                                                             
    div3airport varchar,                                                                             
    div3airportid integer,                                                                           
    div3airportseqid integer,                                                                        
    div3wheelson integer,                                                                            
    div3totalgtime integer,                                                                          
    div3longestgtime integer,                                                                        
    div3wheelsoff integer,                                                                           
    div3tailnum varchar,                                                                             
    div4airport varchar,                                                                             
    div4airportid integer,                                                                           
    div4airportseqid integer,                                                                        
    div4wheelson integer,                                                                            
    div4totalgtime integer,                                                                          
    div4longestgtime integer,                                                                        
    div4wheelsoff integer,                                                                           
    div4tailnum varchar,                                                                             
    div5airport varchar,                                                                             
    div5airportid integer,                                                                           
    div5airportseqid integer,                                                                        
    div5wheelson integer,                                                                            
    div5totalgtime integer,                                                                          
    div5longestgtime integer,                                                                        
    div5wheelsoff integer,                                                                           
    div5tailnum varchar                                                                              
 )                                                                                                   
 WITH (                                                                                              
    external_location = ‘s3://alluxio-public-http-ufs/flights’, 
    format = ‘ORC’                                                                                   
 );

When the Presto query is executed for the first time, it will try to access data from the remote data source and Alluxio will check if the data source has been mounted. If not yet mounted, Alluxio will mount the data source automatically, using the instance’s default credentials to access the remote S3 bucket. Please refer to auto-mount documentation [add the auto-mount docs link] for more information.

$ alluxio fs mounts3://alluxio-public-http-ufs/   on /auto-mount/s3/alluxio-public-http-ufs  (s3, capacity=-1B, used=-1B, read-only, not shared, properties={})s3://alluxio-presto-bucket/dir  on /             (s3, capacity=-1B, used=-1B, not read-only, not shared, properties={})

The s3 bucket in the create table statement is mounted to Alluxio under /auto-mount/s3/<bucket_name>. 

Step 2: Run THE query ON ALLUXIO VIA THE ALLUXIO SAMPLE catalog

Now we try to dig out information from the flights dataset!

LAX is one of the biggest airports in United States and it’s one of the most popular airports for transferring flights. So what’s the worst carrier to fly with while having a change in LAX?

presto>
use alluxio.ontime;

with avg_arr_delays as (select avg(arrdelay) as arr_delay, carrier from flights_orc where dest = ‘LAX’ group by carrier),
    avg_dept_delays as (select avg(depdelay) as dept_delay, carrier from flights_orc where origin = ‘LAX’ group by carrier),
    time_frames as (select dept_delay – arr_delay as time_frame, a.carrier from avg_arr_delays a, avg_dept_delays d where a.carrier = d.carrier group by (dept_delay – arr_delay), a.carrier),
   min_max as (select min(time_frame) min_time_to_change, carrier from time_frames group by carrier)
select t.* from time_frames t inner join min_max m on t.time_frame = m.min_time_to_change order by t.time_frame asc;

This query will give you the answer!

An output table is shown when done;  Looks like EV and RU are the most delayed carrier.


     time_frame      | carrier
———————+———
 -5.3930822327413495 | EV
  -4.012651864264767 | RU
   -3.69879877585484 | TW
 -3.6405191322747203 | NW
 -2.9354615442895993 | FL
  -2.299247202941876 | TZ
 -1.7008222854388864 | YV
 -1.6668243560859004 | HP
  -1.336331893501165 | CO

Press q to leave the query results.

After quitting the results, the query summary should resemble the following:


Query 20191031_212755_00006_dw66g, FINISHED, 1 node
Splits: 718 total, 718 done (100.00%)
0:56 [606M rows, 15.6GB] [10.8M rows/s, 286MB/s]

Because this is the first time we’re reading the data, it is pulled from S3 and is then returned through Alluxio. At the same time, the Alluxio worker will be caching the data in memory so that the next time the data is accessed it can be read at memory speed.

Running the query again should be faster since the data is now cached in Alluxio, unlike the first query which read its data from the S3 bucket. Let’s run it again!


Query 20191031_212905_00007_dw66g, FINISHED, 1 node
Splits: 718 total, 718 done (100.00%)
0:14 [606M rows, 15.6GB] [42.9M rows/s, 1.11GB/s]

Step 3: Run query with The Sample hive catalog

Now we will change to the hive catalog and run the same query again. When using the hive catalog, Presto will access the remote data directly, instead of interacting with Alluxio.



presto> with avg_arr_delays as (select avg(arrdelay) as arr_delay, carrier from flights_orc where dest = ‘LAX’ group by carrier),
    avg_dept_delays as (select avg(depdelay) as dept_delay, carrier from flights_orc where origin = ‘LAX’ group by carrier),
    time_frames as (select dept_delay – arr_delay as time_frame, a.carrier from avg_arr_delays a, avg_dept_delays d where a.carrier = d.carrier group by (dept_delay – arr_delay), a.carrier),
   min_max as (select min(time_frame) min_time_to_change, carrier from time_frames group by carrier)
select t.* from time_frames t inner join min_max m on t.time_frame = m.min_time_to_change order by t.time_frame asc;

The result will be similar however the data in this case is not served through Alluxio and the query will take longer to run.


Query 20191031_213512_00010_dw66g, FINISHED, 1 node
Splits: 718 total, 718 done (100.00%)
0:42 [606M rows, 15.6GB] [14.4M rows/s, 381MB/s]

Presto on Alluxio advantages

Running Presto on Alluxio has never been easier, using features newly added in Alluxio 2.1.0. Most importantly, the amount of configuration is reduced; the above workflow was possible out of the box with only a few configuration entries in the existing catalog/hive.properties.

Alluxio’s seamless integration with transparent Hive integration makes it very easy to use Alluxio as a caching layer for Presto. Presto queries can run significantly faster with Alluxio caching frequently data locally to provide memory speed data analytics for Presto.