How do you partition Hive Table across storage systems using Alluxio?

Today when we create a Hive table, it is a common technique to partition the table across different values and ranges to improve query performance and reduce maintenance cost. However, Hive can not  access a single table directly using a single query with the data of this Hive table across different mediums of storage and different clusters. This becomes a need when the data volume grows too large to fit a single mediums of storage or cluster, and also when the users need to take into the following considerations:

  • Storage cost where some partitions are less important than others and can be stored on cheaper storage tiers.
  • Regional compliance where data cannot leave a region and cannot be persisted outside the region (link)

This is where Alluxio comes in and interfaces with applications like Hive as a distributed virtual file system to create tables with multiple partitionings in a different storage system.  For example, to reduce the storage cost with the same high performance in a multi-cluster environment, you can use Hive with Alluxio to create tables partitioned by date ranges to allow most frequently used datasets to reside on higher tiers of storage(MEM) and have older/less frequently accessed data in other storage service like HDFS with lower cost storage(SSD, HDD) or even remote cloud storage (like S3). In this regard data will always reside in the under-storage system as the source of truth and can be residing temporarily in the Alluxio file system. 

This article aims to guide end users to leverage Alluxio to create external tables in Hive backed by different file locations, each location however representing a subset of the overall data partitioned by the value of choice.

Example

Setup Alluxio

First, deploy an Alluxio cluster  across 4 nodes(1 Master, 3 Workers) on EC2 instances. Please refer to this documentation to install Alluxio cluster or use our Sandbox to deploy Alluxio on AWS EC2 with one click for free.

Alluxio has 2 mounts –  HDP as root mount point, S3 bucket as nested mount

HDP is also deployed on 4 nodes. For this example, it is co-located on the Alluxio nodes, but could be a separate cluster.

[centos@ip-172-31-8-93 alluxio]$ ./bin/alluxio fs ls /
drwxr-xr-x  centos         centos                      1       PERSISTED 02-20-2019 00:31:53:234  DIR /hdfs
drwxr-xr-x  centos         centos                       1   NOT_PERSISTED 02-19-2019 22:21:46:005  DIR /local
drwxrwxrwx  madan          madan                        1       PERSISTED 02-19-2019 22:14:29:532  DIR /s3


Create Hive Tables

Run the following Hive DDL Statements to create an external Hive table  call_center_s3:

create external table call_center_s3(
      cc_call_center_sk         bigint               
,     cc_call_center_id         string              
,     cc_rec_end_date          string                         
,     cc_closed_date_sk         bigint                       
,     cc_open_date_sk           bigint                       
,     cc_name                   string                   
,     cc_class                  string                   
,     cc_employees              int                       
,     cc_sq_ft                  int                       
,     cc_hours                  string                      
,     cc_manager                string                   
,     cc_mkt_id                 int                       
,     cc_mkt_class              string                      
,     cc_mkt_desc               string                  
,     cc_market_manager         string                   
,     cc_division               int                       
,     cc_division_name          string                   
,     cc_company                int                       
,     cc_company_name           string                      
,     cc_street_number          string                      
,     cc_street_name            string                   
,     cc_street_type            string                      
,     cc_suite_number           string                      
,     cc_city                   string                   
,     cc_county                 string                   
,     cc_state                  string                       
,     cc_zip                    string                      
,     cc_country                string                   
,     cc_gmt_offset             double                  
,     cc_tax_percentage         double
)
partitioned by (cc_rec_start_date string)
row format delimited fields terminated by '|' 
location 'alluxio://172.31.9.93:19998/s3/';

We start off by creating the original call_center table against the S3 bucket, notice here that when creating this external table that we do so with the cc_rec_start_date column partitioned in the table. This table is then altered by date and the location is modified accordingly. We alter the table with 3 values as follows:

ALTER TABLE call_center_s3
ADD PARTITION (cc_rec_start_date='1998-01-01')
location 'alluxio://172.31.9.93:19998/s3/';

We have altered table and added the partition for the year 1998, we then allow this to remain in the Alluxio S3 file location as it represents colder data in a remote storage device.

ALTER TABLE call_center_s3
ADD PARTITION (cc_rec_start_date='2001-01-01')
location 'alluxio://172.31.9.93:19998/s3/';

We altered the table with the date range of 2001, this too we will allow to remain on remote S3 storage as it is less frequently accessed and more recent data we can store in an HDFS cluster on disk(SSD).

ALTER TABLE call_center_s3
ADD PARTITION (cc_rec_start_date='2002-01-01')
location 'alluxio://172.31.9.93:19998/hdfs/';

Finally for data that is most current we can add a partition and have it stored against the Alluxio HDFS file location which are on SSDs. This location is on disk across the Alluxio nodes.

[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /hdfs
drwxr-xr-x  centos         centos                      9356       PERSISTED 04-19-2019 18:30:59:822  0% /hdfs/data-m-0099
[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /s3
drwxr-xr-x  centos         centos                       9356   PERSISTED 04-19-2019 19:04:45:107  0% /s3/data-m-00099

In the above screenshot, we see that for the HDFS and S3 mount points it shows the data is still only being PERSISTED in the under file system and is not yet in Alluxio memory. This is evident by the fact that the percentage for the file in this case is 0%  which shows that the file is not yet in Alluxio MEM.

Now after we query against the Hive table we can verify that it is indeed reaching out to the respective understorage file locations by checking once again to see if that data is now in Alluxio MEM(cache_promote is set as default read type policy) and the file shows 100%. Also note, that the data that is in the S3 partition does not get pulled into Alluxio as that partition was eliminated based on the predicate by Hive runtime.

hive>  select * from call_center_s3 where cc_rec_start_date='2002-01-01';

Note this is the date range for table residing against HDFS

[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /hdfs
drwxr-xr-x  centos         centos                      9356       PERSISTED 04-19-2019 18:30:59:822  100% /hdfs/data-m-0099
[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /s3
drwxr-xr-x  centos         centos                       9356   PERSISTED 04-19-2019 19:04:45:107  0% /s3/data-m-00099

Verifying through the Alluxio namespace that the query did bring data up from HDFS and not from another remote directory and persisted it in Alluxio MEM.

hive>  select * from call_center_s3 where cc_rec_start_date='1998-01-01';

Similarly for S3 we query against the table with the date value set to 1998.

[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /hdfs
drwxr-xr-x  centos         centos                      9356       PERSISTED 04-19-2019 18:30:59:822  100% /hdfs/data-m-0099
[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /s3
drwxr-xr-x  centos         centos                       9356   PERSISTED 04-19-2019 19:04:45:107  100% /s3/data-m-00099

Verify again through Alluxio command line and see that the data was brought up from S3 and into the Alluxio namespace.

Conclusion

Many users leverage partitioning to speed up Hive performance and improve maintenance. This blog talks about an interesting way to combine Hive with Alluxio to store a single table but leveraging multiple different storage resources for higher efficiency.