How do we access AWS S3 data when running Presto in an on-premise environment, how can we do it efficiently to reduce both egress cost and performance runtimes?
Alluxio as a local cache for Presto queries against remote AWS S3 data sources
As we move toward more and more decoupled environments one of the things that becomes apparent is the need to solve the slow data access while also limited some of the extremely high costs of cloud environments.
When running compute workflows in an on-prem environment against remote object stores, you can run into performance degradation when having to constantly re-pull the same data into an on-prem environment. Depending on network latencies a significant amount of time spent during the job can just be on this pull into a local environment.
Furthermore economies of cloud storage tell us that pulling data out of a cloud object store can be expensive especially when doing so on a recurring basis.
This article aims to guide end users to leverage Alluxio as a local cache for Presto against remote S3 data sources. By creating external tables in Hive backed by cloud object file you can reduce egress costs by pulling once and speed up performance with iterative Presto jobs for accelerated Presto storage.
Example
Setup Alluxio
First, deploy an Alluxio cluster across 2 nodes(1 Master, 1 Workers) on EC2 instances. Please refer to this documentation to install Alluxio cluster or use our Sandbox to deploy Alluxio on AWS EC2 with one click for free.
Alluxio has 1 mounts – S3 as root mount.
ec2-user@HivePrestoSandbox-masters-1:/tmp/alluxioB ./bin/alluxio fs ls /
-rwx------ madan madan 0 PERSISTED 10-06-2018 00:59:50:000 100% /_SUCCESS
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:403 0% /u_user
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:403 DIR /catalog_page
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:404 DIR /catalog_returns
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:404 DIR /catalog_sales
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:405 DIR /customer
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:405 DIR /customer_address
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:406 DIR /customer_demographics
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:406 DIR /date_dim
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:407 DIR /dbgen_version
drwxr-xr-x ec2-user ec2-user 12 PERSISTED 04-16-2019 14:44:53:389 DIR /default_tests_files
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:407 DIR /household_demographics
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:407 DIR /income_band
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:408 DIR /inventory
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:408 DIR /item
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:412 DIR /time_dim
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:412 DIR /warehouse
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:413 DIR /web_page
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:413 DIR /web_returns
-rwx------ madan madan 1511421369
Above we see that we have mounted the S3 bucket as a root mount point into Alluxio. This bucket contains TPCDS sample data that is currently not in the Alluxio cache. We can confirm this by drilling down into the directories and seeing that there is 0% of data in Alluxio cache.
Create Hive Tables
Run the following Hive DDL Statements to create an external Hive table `u-user`:
```sql
CREATE EXTERNAL TABLE u_user (
userid INT,
age INT,
gender CHAR(1),
occupation STRING,
zipcode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
location 'alluxio://107.23.117.15:19998/u_user/';
```
We start off by creating the original `u_user` table against the S3 bucket.
Now via Presto we can query the same Hive table that is now backed by Alluxio
./bin/presto –server localhost:8082 –catalog hive –schema default
presto:default> select * from u_user limit 10;
userid | age | gender | occupation | zipcode
——–+—–+——–+—————+———
1 | 24 | M | technician | 85711
2 | 53 | F | other | 94043
3 | 23 | M | writer | 32067
4 | 24 | M | technician | 43537
5 | 33 | F | other | 15213
6 | 42 | M | executive | 98101
7 | 57 | M | administrator | 91344
8 | 36 | M | administrator | 05201
9 | 29 | M | student | 01002
10 | 53 | M | lawyer | 90703
(10 rows)
Query 20190418_195736_00004_x89yg, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:00 [943 rows, 22.1KB] [2.44K rows/s, 57.2KB/s]
When we check the Alluxio file system this file corresponding to the table is now in the Alluxio cache
ec2-user@HivePrestoSandbox-masters-1:/tmp/alluxioB ./bin/alluxio fs ls /
-rwx------ madan madan 0 PERSISTED 10-06-2018 00:59:50:000 100% /_SUCCESS
drwx------ madan madan 0 PERSISTED 04-16-2019 14:45:02:403 100% /u_user
drwx------ madan madan 0
We see now that via Alluxio Presto is able to bring remote S3 into a local environment and store it in Alluxio cache. In this example we allowed the application to fetch the data when we queried but we can also preload the data if we need to ensure high SLA’s for certain application workflows. This solves the slow data access.
Learn more about Presto in the upcoming PrestoCon in the Bay Area on March 24 where PrestoDB users and developers from Facebook, Uber, Twitter, and more will share use cases and best practices. Register at https://events.linuxfoundation.org/prestocon/register/