Bursting Your On-Premises Data Lake Analytics and AI Workloads on AWS

Originally published on AWS blog: https://aws.amazon.com/blogs/apn/bursting-your-on-premises-data-lake-analytics-and-ai-workloads-on-aws/

Data is being generated from a myriad of sources. That data is being crunched by analytics, machine learning (ML), and artificial intelligence (AI) models to detect behavior from patterns and gather insights.

Developing and maintaining an on-premises data lake, to make sense of the ingested data, is a complex undertaking. To maximize the value of data and use it as the basis for critical decisions, the data platform must be flexible and cost-effective.

In this post, I will outline a solution for building a hybrid data lake with Alluxio to leverage analytics and AI on Amazon Web Services (AWS) alongside a multi-petabyte on-premises data lake. Alluxio’s solution is called “zero-copy” hybrid cloud, indicating a cloud migration approach without first copying data to Amazon Simple Storage Service (Amazon S3).

The hybrid data lake approach detailed in this post allows for complex data pipelines on-premises to coexist with a modern, flexible, and secure computing paradigm on AWS.

Alluxio is an AWS Advanced Technology Partner with the AWS Data & Analytics Competency that enables incremental migration of a data lake to AWS.

Solution Overview

Data platforms are being built with decoupled storage and compute to scale capacity independently. Computing on AWS is attractive to reduce infrastructure expenses by making use of elastic compute resources for bursty workloads.

At the same time, operating expenses are low on AWS with fully managed choices for SQL as well as machine learning to eliminate the expertise required to operate an on-premises data platform. Data has gravity, and making data generated on-premises accessible in the cloud can be challenging.

One recommended way to architect your data platform is to “burst” workloads to AWS while the data sources remain on-premises. Alluxio is a data orchestration platform that enables the “zero-copy” hybrid cloud burst solution by removing the complexities of data movement. Workloads can be migrated to AWS on demand, without moving data to AWS first, by bringing data to applications on demand.

This approach allows incremental migration of the data pipeline without managing multiple data copies. Applications accessing the same data sets coexist on-premises and on AWS to optimize infrastructure spending by getting the best of both worlds.

Alluxio provides an abstraction layer to the data sources spread across regions and data centers, along with policies to manage data movement across environments.

Burst Compute Without Data Migration

A data pipeline has various lifecycle stages: collection, ingestion, preparation, computation, and presentation. The outlined solution allows customers to migrate the data pipeline incrementally from the on-premises environment to AWS with a hybrid data platform.

A customer’s cloud migration journey may begin with the primary data source on-premises and the use of compute on AWS. “Zero-copy” hybrid cloud bursting allows customers who are not ready for data migration to start leveraging AWS cloud-native computing.

With the Amazon EMR data platform, compute engines such as Apache Spark and Presto are available on AWS to run multi-petabyte analysis directly on data residing on-premises.

Alluxio acts as the access layer, pulling data only when requested. The data orchestration system also allows for definition of pre-fetch policies and pinning of loaded data into a multi-tier file system.

Figure 1 – Typical architecture with analytics on Amazon EMR and Alluxio to access data on-premises.

Live Data Migration to Cloud Storage

Moving data from one storage system to another, while keeping computation jobs running without interruption, is a challenge. Traditionally, this is done by manually copying the data to the destination, asking users to update their applications and queries to use the new Uniform Resource Identifier (URI), waiting until all of the updates are complete, and then finally removing the original copy.

With Alluxio, this can be done seamlessly by using data migration policies. Applications and data catalogs remain unchanged even when hot or archival data is selectively moved from storage on-premises to Amazon S3.

With data in S3, it’s made available for other AWS platforms such as:

Amazon SageMaker to build, train, and deploy machine learning models.
AWS Glue with crawlers to create a data catalog for the migrated data.
Amazon Athena and Amazon QuickSight for business analytics and visualization.

Getting Started

In the following tutorial, I will look at how to use Alluxio to bridge the gap between on-premises data and compute engines on AWS. Using this reference, you can start creating a hybrid data lake to leverage cloud compute. Users of all levels are exposed to a suite of AI, ML, and analytical tools available in this new environment.

This tutorial focuses on a few specific examples, but these examples are meant to illustrate the capabilities of what can be done with the environment and technology. If your specific pieces of technology look different, much of this will still apply.

I will use Amazon EMR and Terraform to quickly get two sample clusters up and running. One of these clusters will act as the existing data lake on Hadoop; the other will be running Presto and Alluxio.

Objectives

Use Amazon EMR and Terraform to create two clusters:
- Simulate an existing data lake with Hadoop, Hive, and HDFS installed. Let’s call this cluster on-prem-cluster.
- Compute cluster with Presto and Alluxio. We’ll call this cluster alluxio-compute-cluster.
Configure a hybrid data lake with Alluxio to interface with HDFS and Amazon S3.
Prepare sample data to use.
Execute queries with Presto on AWS accessing HDFS on-premises.
Define a policy in Alluxio to copy data to S3.

Before You Begin

Prerequisites to execute this tutorial:

Terraform CLI has been installed (> v0.12). Follow the Terraform quickstart guide.
AWS Command Line Interface (CLI) has been installed. Follow the AWS CLI install guide.
AWS credentials for creating resources. Refer to AWS CLI credentials config.
Default EMR roles. Run aws emr create-default-roles if roles don’t exist.

By default, this tutorial uses:

1 EMR on-prem-cluster in us-west-1:
- 1 master * r4.4xlarge on demand instance (16 vCPU and 122GiB Mem)
- 3 workers * r4.4xlarge spot instances (16 vCPU and 122GiB Mem)
1 EMR alluxio-compute-cluster in us-east-1:
- 1 master * r4.4xlarge on demand instance (16 vCPU and 122GiB Mem)
- 3 workers * r5d.4xlarge spot instances (16 vCPU and 128GiB Mem and 600GiB NVMe)

Launching Clusters Using Terraform

For this tutorial, I will use Terraform to execute the following:

Network: Create two virtual private clouds (VPCs) in different regions with VPC peering to connect them.
Amazon EMR: Spin up two EMR clusters.

Create clusters by running the commands locally. First, download the Terraform example”

$ wget https://alluxio-public.s3.amazonaws.com/enterprise-terraform/stable/aws_hybrid_emr_simple.tar.gz 
$ tar -zxf aws_hybrid_emr_simple.tar.gz 
$ cd aws_hybrid_emr_simple

Initialize the Terraform working directory to download the necessary plugins to execute. You only need to run this once for the working directory:

$ terraform init

Create the networking resources and launch the two EMR clusters:

$ terraform apply

Type yes to confirm resource creation. This step will take about 15 minutes to provision the clusters.

After the clusters are launched, the public DNS names of the on-premises cluster master and compute cluster master will be displayed on the console:

Apply complete! Resources: 40 added, 0 changed, 0 destroyed.

Outputs:
alluxio_compute_master_public_dns = <>
on_prem_master_public_dns = <>

Keep the terminal to destroy resources once done with the tutorial.

Access the Clusters

Amazon EMR clusters, by default, will use your OpenSSH public key stored at ~/.ssh/id_rsa.pub to generate temporary AWS key pairs to allow SSH access. Replace the DNS names with their values shown as the result of terraform apply.

SSH into the on-premises cluster:

$ ssh hadoop@${on_prem_master_public_dns}

SSH into the Alluxio compute cluster:

$ ssh hadoop@${alluxio_compute_master_public_dns}

Prepare Data

If you wish to evaluate the setup, copy a dataset to the on-premises cluster and run queries from the compute cluster. Changes to the cluster on-premises, such as updating tables in Hive, will seamlessly propagate to Alluxio without any user intervention using Active Sync.

If your on-premises data lake does not use the Hadoop Distributed File System (HDFS) as storage, refer to other suggested synchronization mechanisms.

SSH into the Hadoop cluster on-prem-cluster to prepare TPC-DS data for query execution:

on-prem-cluster$ hdfs dfs -mkdir /tmp/tpcds/
on-prem-cluster$ s3-dist-cp --src s3a://autobots-tpcds-pregenerated-data/spark/unpart_sf100_10k/store_sales/ --dest hdfs:///tmp/tpcds/store_sales/
on-prem-cluster$ s3-dist-cp --src s3a://autobots-tpcds-pregenerated-data/spark/unpart_sf100_10k/item/ --dest hdfs:///tmp/tpcds/item/

Once data is copied into HDFS, create the table metadata in Hive:

on-prem-cluster$ wget https://alluxio-public.s3.amazonaws.com/hybrid-quickstart/create-table.sql
on-prem-cluster$ hive -f create-table.sql

Step 1: Run a Query

An Amazon EMR cluster with Alluxio pre-configures Presto to access data from remote data sources via Alluxio. Simply SSH into the compute cluster alluxio-compute-cluster and run a query.

Data is loaded on-demand into Alluxio managed storage on the EMR cluster:

alluxio-compute-cluster$ presto-cli --catalog onprem --schema default --execute "select * from students;"

Use Alluxio Admin CLI commands, such as report, to look at data being loaded from HDFS into Alluxio managed storage.

Data is loaded into Alluxio on access, and applications continue to work unchanged as if the HDFS storage cluster is local to compute in AWS:

alluxio-compute-cluster$ alluxio fsadmin report

Step 2: Use Policies for Data Migration

Next, we see:

How to move data from HDFS to Amazon S3 by mounting both under an Alluxio union-UFS.
Create a migration policy to move data from HDFS to S3.

Alluxio will enforce the policy and automatically move the data, while users can keep using the same Alluxio path to access their data, without worrying their computation job will fail after the data is moved.

Replace ${my_bucket} with an accessible bucket name:

alluxio-compute-cluster$ alluxio fs mount \
  --option alluxio-union.hdfs.uri=hdfs:///tmp/tpcds \
  --option alluxio-union.hdfs.option.alluxio.underfs.version=hadoop-2.8 \
  --option alluxio-union.s3.uri=s3://${my_bucket}/ \
  --option alluxio-union.priority.read=s3,hdfs \
  --option alluxio-union.collection.create=s3 \
  /union union://union_ufs/

alluxio-compute-cluster$ alluxio fs policy add /union/store_sales/ "ufsMigrate(olderThan(1m), UFS[hdfs]:REMOVE, UFS[s3]:STORE)"

Once the policy is created, you should see data being moved in the background by inspecting the HDFS and S3 locations. Alluxio policy CLI can also be used to look at the status.

Conclusion

AWS meets compute challenges with on-demand provisioning, and a fully managed elastic infrastructure. Alluxio bridges the gap for data access by orchestrating data movement between cloud compute and a data lake on-premises.

Incremental migration is an appealing solution to reduce compute resource contention on-premises by accelerating AWS adoption for agility and cost efficiency.

To get started with “zero-copy” hybrid cloud bursting to migrate your on-premises workloads to AWS, visit alluxio.io or this how-to guide for more details on the free trial.