TUTORIAL: Configuring Alluxio in the cloud with on-prem HDFS

Outline

Installing and setting up Alluxio
Connecting app frameworks
Mounting remote data stores

Step 1: Installation and set up of Alluxio

Prerequisites

A single master node, and 1 or more worker nodes
SSH login without password to all nodes. You can add a public SSH key for the host into~/.ssh/authorized_keys. See this tutorial for more details.
A shared storage system to mount to Alluxio (accessible by all Alluxio nodes). For this example – HDFS.

SETUP

The following sections describe how to install and configure Alluxio with a single master in a cluster.

To deploy Alluxio in a cluster, first download the Alluxio tar file, and copy it to every node (master node, worker nodes). Extract the tarball to the same path on every node.

On the master node of the installation, create the conf/alluxio-site.properties configuration file from the template.

cp conf/alluxio-site.properties.template conf/alluxio-site.properties

This configuration file (conf/alluxio-site.properties) is read by all Alluxio masters and Alluxio workers. The configuration parameters which must be set are:

alluxio.master.hostname=<MASTER_HOSTNAME>
- This is set to the hostname of the master node.
alluxio.underfs.address=<STORAGE_URI>
- This is set to the URI of the shared storage system to mount to the Alluxio root. This shared shared storage system must be accessible by the master node and all worker nodes.
- Examples: alluxio.underfs.address=hdfs://1.2.3.4:9000/alluxio/root/

This is the minimal configuration to start Alluxio, but additional configuration may be added. Since this same configuration file will be the same for the master and all the workers, this configuration file must be copied to all the other Alluxio nodes. The simplest way to do this is to use the copyDir shell command on the master node. In order to do so, add the IP addresses or hostnames of all the worker nodes to the conf/workers file. Then run:

./bin/alluxio copyDir conf/

This will copy the conf/ directory to all the workers specified in the conf/workers file. Once this command succeeds, all the Alluxio nodes will be correctly configured.

Before Alluxio can be started for the first time, the journal must be formatted. Formatting the journal will delete all metadata from Alluxio. However, the data in under storage will be untouched. On the master node, format Alluxio with the following command:

./bin/alluxio format

To start the Alluxio cluster, on the master node, make sure the conf/workers file is correct with all the hostnames of the workers.

On the master node, start the Alluxio cluster with the following command:

./bin/alluxio-start.sh all SudoMount

This will start the master on the node you are running it on, and start all the workers on all the nodes specified in the conf/workers file.

To verify that Alluxio is running, visit http://:19999 to see the status page of the Alluxio master. Alluxio comes with a simple program that writes and reads sample files in Alluxio. Run the sample program with:

./bin/alluxio runTests

Step 2: Connect app frameworks

Applications of Spark 1.1 or later can access an Alluxio cluster through its HDFS-compatible interface out-of-the-box. Using Alluxio as the data access layer, Spark applications can transparently access data in many different types and instances of persistent storage services (e.g., AWS S3 buckets, Azure Object Store buckets, remote HDFS deployments and etc). Data can be actively fetched or transparently cached into Alluxio to speed up the I/O performance especially when Spark deployment is remote to data. In addition, Alluxio can help simplify the architecture by decoupling compute and physical storage. When the real data path in persistent under storage is hidden from Spark, a change to under storages can be independent from application logic; meanwhile as a near-compute cache Alluxio can still provide compute frameworks like Spark data-locality.

PREREQUISITES

Setup Java for Java 8 Update 60 or higher (8u60+), 64-bit.
Make sure that the Alluxio client jar is available. This Alluxio client jar file can be found at //cli ent/alluxio-1.8.1 client.jar in the tarball downloaded from Alluxio download page. Alternatively, advanced users can compile this client jar from the source code by following the instructions.

Distribute the Alluxio client jar across the nodes where Spark drivers or executors are running. Specifically, put the client jar on the same local path (e.g. //client/alluxio-1.8.1-client.jar) on each node.

Add the Alluxio client jar to the classpath of Spark drivers and executors in order for Spark applications to use the client jar to read and write files in Alluxio. Specifically, add the following line to spark/conf/spark defaults.conf on every node running Spark.

spark.driver.extraClassPath /<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar spark.executor.extraClassPath /<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar

This section shows how to use Alluxio as input and output sources for your Spark applications.

Copy local data to the Alluxio file system. Put the file LICENSE into Alluxio, assuming you are in the Alluxio project directory:

bin/alluxio fs copyFromLocal LICENSE /Input

Run the following commands from spark-shell, assuming Alluxio Master is running on localhost:

> val s = sc.textFile("alluxio://<ALLUXIO_MASTER>:19998/Input") > val double = s.map(line => line + line) > double.saveAsTextFile("alluxio://<ALLUXIO_MASTER>:19998/Output")

Open your browser and check http://<ALLUXIO_MASTER>:19999/browse. There should be an output directory /Output which contains the doubled content of the input file Input.

Step 3: Mount remote data stores

To configure Alluxio to use HDFS as under storage, you will need to modify the configuration file conf/alluxio-site.properties.
Additionally, ensure that Alluxio is able to remotely connect to HDFS over the necessary ports.

Edit conf/alluxio-site.properties file to set the under storage address to the HDFS namenode address and the HDFS directory you want to mount to Alluxio. For example, the under storage address can be hdfs://:9000 if you are running the HDFS namenode locally with default port and mapping HDFS root directory to Alluxio, or hdfs://:9000/alluxio/data if only the HDFS directory /alluxio/data is mapped to Alluxio.

alluxio.underfs.address=hdfs://<NAMENODE_IP>:<PORT>

To configure Alluxio to work with HDFS namenodes in HA mode, you need to configure Alluxio servers to access HDFS with the proper configuration file. Note that once this is set, your applications using Alluxio client do not need any special configuration.

There are two possible approaches:

Copy or make symbolic links from hdfs-site.xml and core-site.xml from your Hadoop installation into ${ALLUX IO_HOME}/conf. Make sure this is set up on all servers running Alluxio.
Alternatively, you can set the property alluxio.underfs.hdfs.configuration in conf/alluxio-site.properties to point to your hdfs-site.xml and core-site.xml. Make sure this configuration is set on all servers running Alluxio.

alluxio.underfs.hdfs.configuration=/path/to/hdfs/conf/core-site.xml:/path/to/hdfs/conf/hdfs-site.xml

Set the under storage address to hdfs://nameservice/ (nameservice is the name of HDFS service already configured in core-site.xml). To mount an HDFS subdirectory to Alluxio instead of the whole HDFS namespace, change the under storage address to something like hdfs://nameservice/alluxio/data.

alluxio.underfs.address=hdfs://nameservice/

Alluxio supports POSIX-like filesystem user and permission checking. To ensure that the permission information of files/directories including user, group and mode in HDFS is consistent with Alluxio (e.g., a file created by user Foo in Alluxio is persisted to HDFS also with owner as user Foo), the user to start Alluxio master and worker processes is required to be either:

HDFS super user. Namely, use the same user that starts HDFS namenode process to also start Alluxio master and worker processes.
A member of HDFS superuser group. Edit HDFS configuration file hdfs-site.xml and check the value of configuration property dfs.permissions.superusergroup. If this property is set with a group (e.g., “hdfs”), add the user to start Alluxio process (e.g., “alluxio”) to this group (“hdfs”); if this property is not set, add a group to this property where your Alluxio running user is a member of this newly added group.

The user set above is only the identity that starts Alluxio master and worker processes. Once Alluxio servers started, it is unnecessary to run your Alluxio client applications using this user.