Alluxio AWS EMR Bootstrap Integration Details


AWS EMR bootstrap provides an easy and flexible way to integrate Alluxio with various frameworks. Alluxio provide various advantages  by enabling data locality and accessibility for the major compute frameworks like Spark, Hive and Presto on S3. One can use a bootstrap action to install Alluxio and customize the configuration of cluster instances. Bootstrap actions are scripts that run on cluster after Amazon EMR launches the instance using the Amazon Linux Amazon Machine Image (AMI). Bootstrap actions run before Amazon EMR installs the applications that you specify when you create the cluster and before cluster nodes begin processing data. If you add nodes to a running cluster, bootstrap actions also run on those nodes in the same way. For details on AWS EMR and bootstrapping refer to the AWS documentation. site

The bootstrap script for Alluxio is available in the Alluxio public S3 bucket at s3://alluxio-public/emr/2.0.1/alluxio-emr.sh along with the configuration file for Spark, Hive and Presto s3://alluxio-public/emr/2.0.1/alluxio-emr.json.


Bootstrap options

A bootstrap script can include a range of options that get passed to EMR as a part of the create-cluster command.

#> aws emr create-cluster --release-label emr-5.23.0 --bootstrap-actions Path=s3://mybucket/filename",Args=[arg1,arg2]...

The Path argument points to the location of the bootstrap script.

With the bootstrap script Alluxio provides, there are a range of options that you can use.

USAGE=”Usage: alluxio-emr.sh <root-ufs-uri>

alluxio-emr.sh <root-ufs-uri>
  [-b <backup_uri>]
  [-d <alluxio-download-uri>]
  [-f <file_uri>]
  [-i <journal_backup_uri>]
  [-p <delimited_properties>]
  [-s <property_delimiter>]

By default, if the environment this script executes in does not already contain an Alluxio install at /opt/alluxio then it will download, untar, and configure Alluxio at /opt/alluxio. If an Alluxio install already exists at /opt/alluxio, nothing will be installed over it, even if -d is specified.

If a different Alluxio version is desired, see the -d option.

OptionDescription
<root-ufs-uri>(Required) The URI of the root UFS in the Alluxio namespace.
-bAn s3:// URI that the Alluxio master will write a backup to upon shutdown of the EMR cluster.
The backup and upload MUST be finish within 60 seconds.
If the backup cannot finish within 60 seconds, then an incomplete journal may be uploaded.
-dAn s3:// or http(s):// URI which points to an Alluxio tarball.
This script will download and untar the Alluxio tarball and install Alluxio at /opt/alluxio,
if an Alluxio installation doesn’t already exist at that location.
-fAn s3:// or http(s):// URI to any remote file.
This property can be specified multiple times.
Any file specified through this property will be downloaded and stored with the same name to /opt/alluxio/conf/.
This can be used to download and copy configuration files for Alluxio
-iAn s3:// or http(s):// URI which represents the URI of a previous Alluxio journal backup.
If supplied, the backup will be downloaded, and upon Alluxio startup,
the Alluxio master will read and restore the backup.
-pA string containing a delimited set of properties which should be added
to the ${ALLUXIO_HOME}/conf/alluxio-site.properties file.
The delimiter by default is a semicolon \”;\”. If a different delimiter is desired use the [-s] argument.
-sA string containing a single character representing what delimiter should be used
to split the Alluxio properties provided in the [-p] argument.

EXAMPLE

Following example command shows how these options can be used:

aws emr create-cluster --release-label emr-5.23.0 \
--instance-type r4.4xlarge \
--instance-count 3 \
--applications Name=Spark Name=Presto Name=Hive \
--name Test-Alluxio \
--bootstrap-actions \
Path=s3://alluxio-public/emr/2.0.1/alluxio-emr.sh,\
Args=[s3://sb-test/emr/mount/,\
-d,https://downloads.alluxio.io/downloads/files/2.0.0/alluxio-2.0.0-bin.tar.gz,\
-f,s3://sb-test/alluxio-site.properties,\
-b,s3://sb-test/my-journal-location,\
-i,s3://sb-test/my-old-journal-location/alluxio-backup-2019-08-27-1566939401342.gz,\
-p,"alluxio.user.block.size.bytes.default=122M|alluxio.user.file.writetype.default=CACHE_THROUGH",\
-s,"|"] \
--configurations https://alluxio-public.s3.amazonaws.com/emr/2.0.1/alluxio-emr.json \
--ec2-attributes KeyName=admin-key \
--log-uri s3://sb-test/emr/bootstrap-logs
ParameterDescription
Path=s3://alluxio-public/emr/2.0.1/alluxio-emr.shPath to the Alluxio bootstrap script location
s3://sb-test/emr/mount/The mandatory parameter which point to the root UFS URI for Alluxio
-d,https://downloads.alluxio.io/downloads/files/2.0.0/alluxio-2.0.0-bin.tar.gzDownload location for Alluxio sepcificed with -d.
If no -d parameter is provided and if AMI does not have Alluxio installed,
Alluxio CE 2.0.1 version will be downloaded and installed.
-f,s3://sb-test/alluxio-site.propertiesThe -f option copies the file s3://sb-test/alluxio-site.properties to
${ALLUXIO_HOME}/conf/alluxio-site.properties file.
This can be done for other configuration files
-b,s3://sb-test/my-journal-locationThis option saves the Alluxio journal under in a tar.gz format when the EMR cluster is shutdown cleanly
-i,s3://sb-test/my-old-journal-location/alluxio-backup-2019-08-27-1566939401342.gzalluxio-backup-2019-08-27-1566939401342.gz was backup taken by using -b command
and will be used to restore Alluxio metadata when the Alluxio cluster is instantiated
-p,”alluxio.user.block.size.bytes.default=122M|
alluxio.user.file.writetype.default=CACHE_THROUGH”
With the -p option, following properties will be added to the
${ALLUXIO_HOME}/conf/alluxio-site.properties file:
alluxio.user.block.size.bytes.default=122M and
alluxio.user.file.writetype.default=CACHE_THROUGH.
This enables properties to be changed from their default.
-s,”|” The separator used to define multiple properties with -p option can be changed and in this example ‘|’ was used.
Default is ‘;’.
Note: AWS CLI doesn’t correctly escape quoted commas so do not try to use “,” as a separator.