How do you modify location metadata in Hive?

Problem

If you have hundreds of external tables defined in Hive, what is the easist way to change those references to point to new locations?

That is a fairly normal challenge for those that want to integrate Alluxio into their stack. A typical setup that we will see is that users will have Spark-SQL or Presto setup as their querying framework. This will tie into Hive and Hive provides metadata to point these querying engines to the correct location of the Parquet or ORC files that live in HDFS or an Object store.

Using Alluxio will typically require some change to the URI as well as a slight change to a path.

S3 and HDFS

s3://alluxio-test/ufs/tpc-ds-test-data/parquet/scale100/warehouse/
hdfs://34.243.133.102:8020/tpc-ds-test-data/parquet/scale100/warehouse/

Become

alluxio://alluxio-master1/scale100/warehouse/

So whats the best way to do this when there may be 100’s of tables involved?

Solution

Recently I had to run through this myself for some internal testing. This quick article walks through some of the things I did to modify Hive.

First I looked at my Hive Table definitions to see what needed to be changed. My setup is running on AWS EMR and I used AWS Glue and a crawler to import my parquet files into Hive. I noticed the crawler makes a mistake with casting. In this example, my table is partiioned on “cssoldtime_sk” as an int, however the crawler casts it as a string in the partition statement. I will need to fix this as well as change the location of the files to an Alluxio URI.

hive> show create table warehouse;
CREATE TABLE `catalog_sales`(
  `cs_sold_time_sk` int,
  `cs_ship_date_sk` int,
  `cs_bill_customer_sk` int,
  `cs_bill_cdemo_sk` int,
  `cs_bill_hdemo_sk` int,
    .........
PARTITIONED BY (
  `cs_sold_date_sk` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://34.243.133.102:8020/tpcds/catalog_sales'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0',
  'CrawlerSchemaSerializerVersion'='1.0',
  'UPDATED_BY_CRAWLER'='nstest1',
  'averageRecordSize'='115',
  'classification'='parquet',
  'compressionType'='none',
  'last_modified_by'='hadoop',
  'last_modified_time'='1558657957',
  'objectCount'='1838',
  'recordCount'='144052377',
  'sizeKey'='12691728701',
  'typeOfData'='file')

The first step here is to dump all of my table definitions. I used the script below to simplify things. The ouptut of this script is a ddl file for every table that contains the table create statements.

#!/bin/sh

if [ $# -ne 1 ]
then
    echo "Usage: $0 <databasename>"
    exit 1
fi
DB_NAME="${1}"
HIVE="hive -S"
tables="`${HIVE} --database ${DB_NAME} -e 'show tables;'`"
for table in ${tables}
do
    ${HIVE} --database ${DB_NAME} -e "show create table  $table;" > ${table}.ddl
done

From there, I look at the things I want to fix in Hive.

Fix the partition key
Modify the URI with the correct Alluxio URI
Inform Hive that there are new paritions

I wrote another short script to do the work for me. In this case, I create a new database reference so that I can work with the dataset in its original definition or with Alluxio.

#!/bin/sh

if [ $# -ne 4 ]
then
    echo "Usage: $0 <databasename> <newdatabasename> <s3filepath> <alluxioufspath>"
    echo "example: $0 nstest nstest1 s3://alluxio-test/ufs/tpc-ds-test-data/parquet/scale100 alluxio:///"
    exit 1
fi
DB_NAME="${1}"
NEW_DB_NAME="${2}"
S3_FILE_PATH="${3}"
ALLUXIO_UFS_PATH="${4}"
HIVE="hive -S"
tables="`ls *.ddl`"

# Create new database
hive -e "create database ${NEW_DB_NAME}"

for table in ${tables}
do
    echo "Working on $table"
    TABLE_NAME="`basename $table .ddl`"
    # Fix Parition key
    sed -i '
        :1
        /PARTITIONED/ {
            n
            s/string/int/
            b1
        }' ${table}

    # Change file path
    sed -i s^${S3_FILE_PATH}^${ALLUXIO_UFS_PATH}^ ${table}

    # Create table defintion in new database
    hive --database ${NEW_DB_NAME} -f ${table}

    # Repair partitions for Hive
    hive --database ${NEW_DB_NAME} -e "MSCK REPAIR TABLE ${TABLE_NAME}"
done

If you are working with a more standard Hive setup and only need to change the table locations in the current database instance, you can simplify things with the ALTER TABLE statement.

#!/bin/sh

if [ $# -ne 3 ]
then
    echo "Usage: $0 <databasename> <s3filepath> <alluxioufspath>"
    echo "example: $0 db1 s3://alluxio-test/ufs/tpc-ds-test-data/parquet/scale100 alluxio:///"
    exit 1
fi
DB_NAME="${1}"
S3_FILE_PATH="${3}"
ALLUXIO_UFS_PATH="${4}"
HIVE="hive -S"
tables="`ls *.ddl`"

for table in ${tables}
do
    echo "Working on $table"
    TABLE_NAME="`basename $table .ddl`"
   
    # Change file path
    TABLE_LOCATION="`sed -i s^${S3_FILE_PATH}^${ALLUXIO_UFS_PATH}^ ${table} | grep alluxio`"
     echo "ALTER TABLE  ${TABLE_NAME} SET LOCATION \"${TABLE_LOCATION}\";" >> ${FILE}
     echo "MSCK REPAIR TABLE ${TABLE_NAME};" >> ${FILE}

done

hive --database ${DB_NAME} -f ${FILE}

Tags: apache hive, aws s3, emr, presto

Alluxio and Compute Answers

Problem

Solution