
TUTORIAL: Configuring Alluxio in the cloud with on-prem HDFS
Outline
- Installing and setting up Alluxio
- Connecting app frameworks
- Mounting remote data stores
Step 1: Installation and set up of Alluxio
Prerequisites
- A single master node, and 1 or more worker nodes
- SSH login without password to all nodes. You can add a public SSH key for the host into~/.ssh/authorized_keys. See this tutorial for more details.
- A shared storage system to mount to Alluxio (accessible by all Alluxio nodes). For this example – HDFS.
SETUP
The following sections describe how to install and configure Alluxio with a single master in a cluster.
Step 2: Connect app frameworks
Applications of Spark 1.1 or later can access an Alluxio cluster through its HDFS-compatible interface out-of-the-box. Using Alluxio as the data access layer, Spark applications can transparently access data in many different types and instances of persistent storage services (e.g., AWS S3 buckets, Azure Object Store buckets, remote HDFS deployments and etc). Data can be actively fetched or transparently cached into Alluxio to speed up the I/O performance especially when Spark deployment is remote to data. In addition, Alluxio can help simplify the architecture by decoupling compute and physical storage. When the real data path in persistent under storage is hidden from Spark, a change to under storages can be independent from application logic; meanwhile as a near-compute cache Alluxio can still provide compute frameworks like Spark data-locality.
PREREQUISITES
- Setup Java for Java 8 Update 60 or higher (8u60+), 64-bit.
- Make sure that the Alluxio client jar is available. This Alluxio client jar file can be found at //cli ent/alluxio-1.8.1 client.jar in the tarball downloaded from Alluxio download page. Alternatively, advanced users can compile this client jar from the source code by following the instructions.
Step 3: Mount remote data stores
To configure Alluxio to use HDFS as under storage, you will need to modify the configuration file conf/alluxio-site.properties.
Additionally, ensure that Alluxio is able to remotely connect to HDFS over the necessary ports.