How do you access AWS S3 data when running Presto in an on-premise environment?

How do we access AWS S3 data when running Presto in an on-premise environment, how can we do it efficiently to reduce both egress cost and performance runtimes?

Alluxio as a local cache for Presto queries against remote AWS S3 data sources

As we move toward more and more decoupled environments one of the things that becomes apparent is the need to solve the slow data access while also limited some of the extremely high costs of cloud environments.

When running compute workflows in an on-prem environment against remote object stores, you can run into performance degradation when having to constantly re-pull the same data into an on-prem environment. Depending on network latencies a significant amount of time spent during the job can just be on this pull into a local environment.

Furthermore economies of cloud storage tell us that pulling data out of a cloud object store can be expensive especially when doing so on a recurring basis.

This article aims to guide end users to leverage Alluxio as a local cache for Presto against remote S3 data sources. By creating external tables in Hive backed by cloud object file you can reduce egress costs by pulling once and speed up performance with iterative Presto jobs for accelerated Presto storage.

# Example

## Setup Alluxio

First, deploy an Alluxio cluster  across 2 nodes(1 Master, 1 Workers) on EC2 instances. Please refer to this documentation to install Alluxio cluster or use our Sandbox to deploy Alluxio on AWS EC2 with one click for free.

Alluxio has 1 mounts –  S3 as root mount.

ec2-user@HivePrestoSandbox-masters-1:/tmp/alluxioB ./bin/alluxio fs ls /
-rwx------ madan          madan                        0       PERSISTED 10-06-2018 00:59:50:000 100% /_SUCCESS
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:403  0% /u_user
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:403  DIR /catalog_page
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:404  DIR /catalog_returns
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:404  DIR /catalog_sales
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:405  DIR /customer
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:405  DIR /customer_address
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:406  DIR /customer_demographics
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:406  DIR /date_dim
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:407  DIR /dbgen_version
drwxr-xr-x ec2-user       ec2-user                    12       PERSISTED 04-16-2019 14:44:53:389  DIR /default_tests_files
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:407  DIR /household_demographics
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:407  DIR /income_band
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:408  DIR /inventory
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:408  DIR /item
drwx------ madan          madan                        0        PERSISTED 04-16-2019 14:45:02:412  DIR /time_dim
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:412  DIR /warehouse
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:413  DIR /web_page
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:413  DIR /web_returns
-rwx------ madan          madan               1511421369       

Above we see that we have mounted the S3 bucket as a root mount point into Alluxio. This bucket contains TPCDS sample data that is currently not in the Alluxio cache. We can confirm this by drilling down into the directories and seeing that there is 0% of data in Alluxio cache.

## Create Hive Tables

Run the following Hive DDL Statements to create an external Hive table `u-user`:

```sql
CREATE EXTERNAL TABLE u_user (
userid INT,
age INT,
gender CHAR(1),
occupation STRING,
zipcode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

location 'alluxio://107.23.117.15:19998/u_user/';
```

We start off by creating the original `u_user` table against the S3 bucket.

Now via Presto we can query the same Hive table that is now backed by Alluxio

./bin/presto –server localhost:8082 –catalog hive –schema default

presto:default> select * from u_user limit 10;

userid | age | gender |  occupation | zipcode

——–+—–+——–+—————+———

     1 | 24 | M      | technician | 85711

     2 | 53 | F      | other | 94043

     3 | 23 | M      | writer | 32067

     4 | 24 | M      | technician | 43537

     5 | 33 | F      | other | 15213

     6 | 42 | M      | executive | 98101

     7 | 57 | M      | administrator | 91344

     8 | 36 | M      | administrator | 05201

     9 | 29 | M      | student | 01002

    10 | 53 | M      | lawyer | 90703

(10 rows)

Query 20190418_195736_00004_x89yg, FINISHED, 1 node

Splits: 18 total, 18 done (100.00%)

0:00 [943 rows, 22.1KB] [2.44K rows/s, 57.2KB/s]

When we check the Alluxio file system this file corresponding to the table is now in the Alluxio cache

ec2-user@HivePrestoSandbox-masters-1:/tmp/alluxioB ./bin/alluxio fs ls /
-rwx------ madan          madan                        0       PERSISTED 10-06-2018 00:59:50:000 100% /_SUCCESS
drwx------ madan          madan                        0       PERSISTED 04-16-2019 14:45:02:403  100% /u_user
drwx------ madan          madan                        0       

We see now that via Alluxio Presto is able to bring remote S3 into a local environment and store it in Alluxio cache. In this example we allowed the application to fetch the data when we queried but we can also preload the data if we need to ensure high SLA’s for certain application workflows.