TUTORIAL: GETTING STARTED WITH AWS EMR AND ALLUXIO AMI
5 min TutoriaL
AWS EMR provides great options for running clusters on-demand to handle compute workloads. It manages the deployment of various Hadoop Services and allows for hooks into these services for customizations. Alluxio can run on EMR to provide functionality above what EMRFS currently provides. Aside from the added performance benefits of caching, Alluxio also enables users to run compute workloads against on-premise storage or even a different cloud provider’s storage i.e. GCS, Azure Blob Store.
- Account with AWS
- IAM Account with the default EMR Roles
- Key Pair for EC2
- An S3 Bucket
- AWS CLI: Make sure that the AWS CLI is also set up and ready with the required AWS Access/Secret key
The majority of the pre-requisites can be found by going through the AWS EMR Getting Started guide. An S3 bucket is needed as Alluxio’s Root Under File System. The bootstrap script and the cluster configuration files are available in a public S3 bucket. If required, the root UFS can be reconfigured to be HDFS.
Now, it is easy to integrate Alluxio Enterprise Edition with EMR using an Alluxio AMI from the AWS Marketplace. First step is to subscribe to the marketplace AMI. First 7 days are free! Then you can pay as you go.
Using this bootstrap script, Alluxio is installed in
/opt/alluxio/ by default. Hive and Presto are already configured to connect to Alluxio. The cluster also uses AWS Glue as the default metastore for both Presto and Hive. This will allow you to maintain table definitions between multiple runs of the Alluxio cluster.
See the below sample command for reference.
Notes: The default Alluxio Worker memory is set to 1/3rd of the total memory. If the instance type has less than 20GB of memory, change the value in the
alluxio-emr.sh script by making your own copy.
CREATING A TABLE
The simplest step to using EMR with Alluxio is to create a table on Alluxio and query it using Presto/Hive.
Tuning of Alluxio properties can be done in a few different locations. Depending on which service needs tuning, EMR offers different ways of modifying the service settings/environment variables.