TUTORIAL: GETTING STARTED WITH AWS EMR AND ALLUXIO AMI


5 min TutoriaL

AWS EMR provides great options for running clusters on-demand to handle compute workloads. It manages the deployment of various Hadoop Services and allows for hooks into these services for customizations. Alluxio can run on EMR to provide functionality above what EMRFS currently provides. Aside from the added performance benefits of caching, Alluxio also enables users to run compute workloads against on-premise storage or even a different cloud provider’s storage i.e. GCS, Azure Blob Store.


Outline


Prerequisites

  • Account with AWS
  • IAM Account with the default EMR Roles
  • Key Pair for EC2
  • An S3 Bucket
  • AWS CLI: Make sure that the AWS CLI is also set up and ready with the required AWS Access/Secret key

The majority of the pre-requisites can be found by going through the AWS EMR Getting Started guide. An S3 bucket is needed as Alluxio’s Root Under File System. The bootstrap script and the cluster configuration files are available in a public S3 bucket. If required, the root UFS can be reconfigured to be HDFS.


BASIC SETUP

Now, it is easy to integrate Alluxio Enterprise Edition with EMR using an Alluxio AMI from the AWS Marketplace. First step is to subscribe to the marketplace AMI. First 7 days are free! Then you can pay as you go.

Go to the Alluxio AMI page on the AWS Marketplace

Click on “Continue to Subscribe”.

Review pricing and the terms. Once done, “Accept Terms”.

The Alluxio AMI is now associated with your account. The subscription includes a 7 day free trial.

$ aws emr create-default-roles
  • Release Label is the version of EMR that should be installed
  • The Alluxio Marketplace Enterprise Edition AMI ID “ami-0a53794238d399ab6” which will be used as the base AMI for the EMR cluster
  • Instance count and type are the number of nodes and type of instances for the EMR cluster. Make sure you pick an instance that the Alluxio marketplace AMI supports
  • Name is the EMR cluster name for tracking
  • The root-ufs-uri. This should be an s3://  URI designating the root mount of the Alluxio file system. This is a mandatory property, it would be in the form of s3://<bucket-name-for-ufs>/<mount-point>/. The mount point should be a folder, to create a folder in AWS follow the instructions in this link
  • Extra alluxio options. These are specified as a comma-separated list of key-values in the format <key>=<value>. For example, alluxio.user.file.writetype.default=CACHE_THROUGH tells Alluxio to write files synchronously to the storage system underneath. Learn more about the write type option
aws emr create-cluster \
--release-label <release-version> \
--custom-ami-id <Alluxio-Enterprise-AMI-ID> \
--instance-count <instance-count> \
--instance-type <instance-type> \
--applications Name=Spark Name=Presto Name=Hive \
--name <Cluster-name> \
--bootstrap-actions \
Path=s3://alluxio-public/emr/2.0.1/alluxio-emr.sh,\
Args=[s3://<bucket-name-for-ufs>/<mount-point>,\
-p,"alluxio.user.block.size.bytes.default=122M|alluxio.user.file.writetype.default=CACHE_THROUGH",\
-s,"|"] \
--configurations https://alluxio-public.s3.amazonaws.com/emr/2.0.1/alluxio-emr.json \
--ec2-attributes KeyName=<ec2-keypair-name>

Make sure you check your instance limits to ensure you can create the cluster for the given instance type.
You can check your EC2 instance limits here.

Here’s an example of the AWS EMR command to create a cluster using a bootstrap script and using the Alluxio Marketplace AMI.
Other parameters can be added as required.

aws emr create-cluster \
--release-label emr-5.23.0 \
--custom-ami-id ami-0a53794238d399ab6 \
--instance-count 3 \
--instance-type r4.2xlarge \
--applications Name=Spark Name=Presto Name=Hive \
--name try-Alluxio \
--bootstrap-actions \
Path=s3://alluxio-public/emr/2.0.1/alluxio-emr.sh,\
Args=[s3://my-test-bucket/mount/,\
-p,"alluxio.user.block.size.bytes.default=122M|alluxio.user.file.writetype.default=ASYNC_THROUGH",\
-s,"|"] \
--configurations https://alluxio-public.s3.amazonaws.com/emr/2.0.1/alluxio-emr.json \
--ec2-attributes KeyName=admin-key

Log into the EMR Console

Once the cluster is in the ‘Waiting’ stage, click on the cluster details to get the ‘Master public DNS’ if available or click on the “Hardware” tab to see the master and worker details. Clicking on the master instance group will show you the Public DNS.

SSH into the master instance using the keypair provided in the previous command. If a security group isn’t specified via CLI, the default EMR security group will not allow inbound SSH. To SSH into the machine, a new rule will need to be added.

ssh -i ~/admin-key.pem hadoop@PUBLICDNS.com
sudo runuser -l alluxio -c "/opt/alluxio/bin/alluxio runTests"

Sample output:

runTest BASIC CACHE_PROMOTE MUST_CACHE
2019-08-30 21:27:23,798 INFO  BasicOperations - writeFile to file /default_tests_files/BASIC_CACHE_PROMOTE_MUST_CACHE took 427 ms.
2019-08-30 21:27:23,895 INFO  BasicOperations - readFile file /default_tests_files/BASIC_CACHE_PROMOTE_MUST_CACHE took 97 ms.
Passed the test!

Using this bootstrap script, Alluxio is installed in /opt/alluxio/ by default. Hive and Presto are already configured to connect to Alluxio. The cluster also uses AWS Glue as the default metastore for both Presto and Hive. This will allow you to maintain table definitions between multiple runs of the Alluxio cluster.

See the below sample command for reference.

Notes: The default Alluxio Worker memory is set to 1/3rd of the total memory. If the instance type has less than 20GB of memory, change the value in the alluxio-emr.sh script by making your own copy.


CREATING A TABLE

The simplest step to using EMR with Alluxio is to create a table on Alluxio and query it using Presto/Hive.

Then switch to the ‘alluxio’ user. Run the following command.

sudo su alluxio
$ /opt/alluxio/bin/alluxio fs mkdir /testTable
$ /opt/alluxio/bin/alluxio fs chown hadoop:hadoop /testTable
$ hive
CREATE DATABASE glue;
USE glue;
create external table test1 (userid INT,
age INT,
gender CHAR(1),
occupation STRING,
zipcode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION 'alluxio:///testTable';
#Create Presto temp directory
$ sudo runuser -l alluxio -c "/opt/alluxio/bin/alluxio fs mkdir /tmp"
$ sudo runuser -l alluxio -c "/opt/alluxio/bin/alluxio fs chmod 777 /tmp"
$ presto-cli --catalog hive
USE glue;
INSERT INTO test1 VALUES (1, 24, 'F', 'Developer', '12345');
SELECT * FROM test1;

CUSTOMIZATION

Tuning of Alluxio properties can be done in a few different locations. Depending on which service needs tuning, EMR offers different ways of modifying the service settings/environment variables.

Any server-side configuration changes must be made in the alluxio-emr.sh bootstrap script. In the section for generating the alluxio-site.properties, add a line with the configuration needed to append to the bottom of the file. Options can also be passed as the 3rd argument to the bootstrap script with a ‘;’ delimiter.

Generic client-side properties can also be edited via the bootstrap script as mentioned above. This is mostly for the native client (CLI). Property changes for a specific service like Presto/Hive should be done in the respective configuration file i.e. core-site.xmlhive.catalog.

Network considerations for remote data

Alluxio can be used to pull remote data from private data centers or remote regions into the cluster. Here are some recommendations and best practices to consider when connected to remote data on premises.

AWS Direct Connect – AWS Direct Connect is a cloud service solution that makes it easy to establish a dedicated network connection from your premises to AWS. Using AWS Direct Connect, you can establish private connectivity between AWS and your datacenter, office, or colocation environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.

AWS VPN – AWS Virtual Private Network (AWS VPN) lets you establish a secure and private encrypted tunnel from your network or device to the AWS global network. AWS VPN is comprised of two services: AWS Site-to-Site VPN and AWS Client VPN.

AWS Direct Connect Resiliency RecommendationsAmazon Web Services (AWS) offers customers the ability to achieve highly resilient network connections between Amazon Virtual Private Cloud (Amazon VPC) and their on-premises infrastructure. This capability extends customer access to AWS resources in a reliable, scalable, and cost-effective way. This document explain AWS best practices for ensuring high resiliency with AWS Direct Connect.

Multi data center HA network connectivity – This document includes best practices on how to make network connections highly available and how to best leverage redundant connections, especially when these connections support remote networks that are geographically dispersed.