TUTORIAL: GETTING STARTED WITH AWS EMR AND ALLUXIO AMI
5 min TutoriaL
AWS EMR provides great options for running clusters on-demand to handle compute workloads. It manages the deployment of various Hadoop Services and allows for hooks into these services for customizations. Alluxio can run on EMR to provide functionality above what EMRFS currently provides. Aside from the added performance benefits of caching, Alluxio also enables users to run compute workloads against on-premise storage or even a different cloud provider’s storage i.e. GCS, Azure Blob Store.
Outline
Prerequisites
- Account with AWS
- IAM Account with the default EMR Roles
- Key Pair for EC2
- An S3 Bucket
- AWS CLI: Make sure that the AWS CLI is also set up and ready with the required AWS Access/Secret key
The majority of the pre-requisites can be found by going through the AWS EMR Getting Started guide. An S3 bucket is needed as Alluxio’s Root Under File System. The bootstrap script and the cluster configuration files are available in a public S3 bucket. If required, the root UFS can be reconfigured to be HDFS.
BASIC SETUP
Now, it is easy to integrate Alluxio Enterprise Edition with EMR using an Alluxio AMI from the AWS Marketplace. First step is to subscribe to the marketplace AMI. First 7 days are free! Then you can pay as you go.
Using this bootstrap script, Alluxio is installed in /opt/alluxio/
by default. Hive and Presto are already configured to connect to Alluxio. The cluster also uses AWS Glue as the default metastore for both Presto and Hive. This will allow you to maintain table definitions between multiple runs of the Alluxio cluster.
See the below sample command for reference.
Notes: The default Alluxio Worker memory is set to 1/3rd of the total memory. If the instance type has less than 20GB of memory, change the value in the alluxio-emr.sh
script by making your own copy.
CREATING A TABLE
The simplest step to using EMR with Alluxio is to create a table on Alluxio and query it using Presto/Hive.
CUSTOMIZATION
Tuning of Alluxio properties can be done in a few different locations. Depending on which service needs tuning, EMR offers different ways of modifying the service settings/environment variables.
Network considerations for remote data
Alluxio can be used to pull remote data from private data centers or remote regions into the cluster. Here are some recommendations and best practices to consider when connected to remote data on premises.
AWS Direct Connect – AWS Direct Connect is a cloud service solution that makes it easy to establish a dedicated network connection from your premises to AWS. Using AWS Direct Connect, you can establish private connectivity between AWS and your datacenter, office, or colocation environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.
AWS VPN – AWS Virtual Private Network (AWS VPN) lets you establish a secure and private encrypted tunnel from your network or device to the AWS global network. AWS VPN is comprised of two services: AWS Site-to-Site VPN and AWS Client VPN.
AWS Direct Connect Resiliency Recommendations – Amazon Web Services (AWS) offers customers the ability to achieve highly resilient network connections between Amazon Virtual Private Cloud (Amazon VPC) and their on-premises infrastructure. This capability extends customer access to AWS resources in a reliable, scalable, and cost-effective way. This document explain AWS best practices for ensuring high resiliency with AWS Direct Connect.
Multi data center HA network connectivity – This document includes best practices on how to make network connections highly available and how to best leverage redundant connections, especially when these connections support remote networks that are geographically dispersed.