Today when we create a Hive table, it is a common technique to partition the table across different values and ranges to improve query performance and reduce maintenance cost. However, Hive can not access a single table directly using a single query with the data of this Hive table across different mediums of storage and different clusters. This becomes a need when the data volume grows too large to fit a single mediums of storage or cluster, and also when the users need to take into the following considerations:
- Storage cost where some partitions are less important than others and can be stored on cheaper storage tiers.
- Regional compliance where data cannot leave a region and cannot be persisted outside the region (link)
This is where Alluxio comes in and interfaces with applications like Hive as a distributed virtual file system to create tables with multiple partitionings in a different storage system. For example, to reduce the storage cost with the same high performance in a multi-cluster environment, you can use Hive with Alluxio to create tables partitioned by date ranges to allow most frequently used datasets to reside on higher tiers of storage(MEM) and have older/less frequently accessed data in other storage service like HDFS with lower cost storage(SSD, HDD) or even remote cloud storage (like S3). In this regard data will always reside in the under-storage system as the source of truth and can be residing temporarily in the Alluxio file system.
This article aims to guide end users to leverage Alluxio to create external tables in Hive backed by different file locations, each location however representing a subset of the overall data partitioned by the value of choice.
Example
Setup Alluxio
First, deploy an Alluxio cluster across 4 nodes(1 Master, 3 Workers) on EC2 instances. Please refer to this documentation to install Alluxio cluster or use our Sandbox to deploy Alluxio on AWS EC2 with one click for free.
Alluxio has 2 mounts – HDP as root mount point, S3 bucket as nested mount
HDP is also deployed on 4 nodes. For this example, it is co-located on the Alluxio nodes, but could be a separate cluster.
[centos@ip-172-31-8-93 alluxio]$ ./bin/alluxio fs ls /
drwxr-xr-x centos centos 1 PERSISTED 02-20-2019 00:31:53:234 DIR /hdfs
drwxr-xr-x centos centos 1 NOT_PERSISTED 02-19-2019 22:21:46:005 DIR /local
drwxrwxrwx madan madan 1 PERSISTED 02-19-2019 22:14:29:532 DIR /s3
Create Hive Tables
Run the following Hive DDL Statements to create an external Hive table call_center_s3
:
create external table call_center_s3(
cc_call_center_sk bigint
, cc_call_center_id string
, cc_rec_end_date string
, cc_closed_date_sk bigint
, cc_open_date_sk bigint
, cc_name string
, cc_class string
, cc_employees int
, cc_sq_ft int
, cc_hours string
, cc_manager string
, cc_mkt_id int
, cc_mkt_class string
, cc_mkt_desc string
, cc_market_manager string
, cc_division int
, cc_division_name string
, cc_company int
, cc_company_name string
, cc_street_number string
, cc_street_name string
, cc_street_type string
, cc_suite_number string
, cc_city string
, cc_county string
, cc_state string
, cc_zip string
, cc_country string
, cc_gmt_offset double
, cc_tax_percentage double
)
partitioned by (cc_rec_start_date string)
row format delimited fields terminated by '|'
location 'alluxio://172.31.9.93:19998/s3/';
We start off by creating the original call_center
table against the S3 bucket, notice here that when creating this external table that we do so with the cc_rec_start_date
column partitioned in the table. This table is then altered by date and the location is modified accordingly. We alter the table with 3 values as follows:
ALTER TABLE call_center_s3
ADD PARTITION (cc_rec_start_date='1998-01-01')
location 'alluxio://172.31.9.93:19998/s3/';
We have altered table and added the partition for the year 1998, we then allow this to remain in the Alluxio S3 file location as it represents colder data in a remote storage device.
ALTER TABLE call_center_s3
ADD PARTITION (cc_rec_start_date='2001-01-01')
location 'alluxio://172.31.9.93:19998/s3/';
We altered the table with the date range of 2001, this too we will allow to remain on remote S3 storage as it is less frequently accessed and more recent data we can store in an HDFS cluster on disk(SSD).
ALTER TABLE call_center_s3
ADD PARTITION (cc_rec_start_date='2002-01-01')
location 'alluxio://172.31.9.93:19998/hdfs/';
Finally for data that is most current we can add a partition and have it stored against the Alluxio HDFS file location which are on SSDs. This location is on disk across the Alluxio nodes.
[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /hdfs
drwxr-xr-x centos centos 9356 PERSISTED 04-19-2019 18:30:59:822 0% /hdfs/data-m-0099
[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /s3
drwxr-xr-x centos centos 9356 PERSISTED 04-19-2019 19:04:45:107 0% /s3/data-m-00099
In the above screenshot, we see that for the HDFS and S3 mount points it shows the data is still only being PERSISTED
in the under file system and is not yet in Alluxio memory. This is evident by the fact that the percentage for the file in this case is 0% which shows that the file is not yet in Alluxio MEM.
Now after we query against the Hive table we can verify that it is indeed reaching out to the respective understorage file locations by checking once again to see if that data is now in Alluxio MEM(cache_promote is set as default read type policy) and the file shows 100%. Also note, that the data that is in the S3 partition does not get pulled into Alluxio as that partition was eliminated based on the predicate by Hive runtime.
hive> select * from call_center_s3 where cc_rec_start_date='2002-01-01';
Note this is the date range for table residing against HDFS
[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /hdfs
drwxr-xr-x centos centos 9356 PERSISTED 04-19-2019 18:30:59:822 100% /hdfs/data-m-0099
[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /s3
drwxr-xr-x centos centos 9356 PERSISTED 04-19-2019 19:04:45:107 0% /s3/data-m-00099
Verifying through the Alluxio namespace that the query did bring data up from HDFS and not from another remote directory and persisted it in Alluxio MEM.
hive> select * from call_center_s3 where cc_rec_start_date='1998-01-01';
Similarly for S3 we query against the table with the date value set to 1998.
[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /hdfs
drwxr-xr-x centos centos 9356 PERSISTED 04-19-2019 18:30:59:822 100% /hdfs/data-m-0099
[centos@ip-172-31-8-218 alluxio]$ ./bin/alluxio fs ls /s3
drwxr-xr-x centos centos 9356 PERSISTED 04-19-2019 19:04:45:107 100% /s3/data-m-00099
Verify again through Alluxio command line and see that the data was brought up from S3 and into the Alluxio namespace.
Conclusion
Many users leverage partitioning to speed up Hive performance and improve maintenance. This blog talks about an interesting way to combine Hive with Alluxio to store a single table but leveraging multiple different storage resources for higher efficiency.