Alluxio Sandbox


Now that you have downloaded and tried Alluxio on a single node, test drive it for 24 hours in a real world cluster environment as part of a standard data stack with Apache Spark as the compute framework and Amazon AWS S3 as the data storage layer.


Prerequisites

You should be comfortable using a terminal and running bash commands to interact with the sandbox. A pem file and SSH instructions will be emailed to you to access the sandbox; these instructions assume you are working on a UNIX based system. Please take a look at the README to learn about the sandbox.



Sandbox host

After successfully requesting a sandbox, an email will be sent with instructions to SSH into an EC2 instance. This instance is the sandbox host and contains a binary to issue commands to the sandbox cluster. The binary on the host instance allows you to launch, run tests on, and destroy the sandbox cluster. After the sandbox cluster is launched, you can SSH into cluster to inspect how Alluxio and other services are setup, edit configurations, and run commands. The cluster can be destroyed and relaunched any number of times.

The costs for hosting these instances on EC2 is covered by Alluxio. The sandbox will be reclaimed after 24 hours; please note the exact time in the email. There is also a limit to the number of sandbox requests each user can request.


Sandbox cluster

The cluster consists of 2 master nodes and 4 worker nodes. All 6 EC2 instances are of type r4.2xlarge, with 8 vCPUs and 61GB of memory. The operating system is CentOS7.

The cluster installs Alluxio with a S3 bucket as its root under file storage. Alluxio is configured for high availability with 2 masters; Hadoop and Zookeeper are installed for Alluxio to operate in high availability mode.

The TPC-DS benchmark suite is installed for running performance tests. Spark is installed as the compute framework for TPC-DS to send its jobs to. TPC-DS runs on a scale factor of 100, which correlates to a dataset size of 26GB. The benchmarks, individually identified by their index, are grouped by different usage scenarios and results are reported as an aggregate of each scenario.

TPC-DS runs Spark jobs on two different data architectures to demonstrate the performance benefits of Alluxio.
  – Spark directly processes data stored on S3
  – Spark interfaces with Alluxio whose under file storage is S3