Tutorial: Getting started with GCP Dataproc and Alluxio
5 MIN TUTORIAL
Overview
Google Cloud Dataproc is a managed on-demand service to run Spark and Hadoop compute workloads. It manages the deployment of various Hadoop Services and allows for hooks into these services for customizations. Aside from the added performance benefits of caching, Alluxio also enables users to run compute workloads against on-premise storage or even a different cloud provider’s storage i.e. AWS, Azure Blob Store.
Prerequisites
- Account with Cloud Dataproc API enabled
- A GCS Bucket
- gcloud CLI: Make sure that the CLI is set up with necessary GCS interoperable storage access keys. Note: GCS interoperability should be enabled in the Interoperability tab in GCS setting.
A GCS bucket required as Alluxio’s Root Under File System and to serve as the location for the bootstrap script. If required, the root UFS can be reconfigured to be HDFS or any other supported under store.
Setup
PIck an Edition
You can use either the Alluxio Community Edition or the Enterprise Edition with Google Cloud Dataproc.
NeXT STEPS
The status of the cluster deployment can be monitored using the CLI.
$ gcloud dataproc clusters list
Identify the instance name and SSH into this instance to test the deployment.
$ gcloud compute ssh <cluster_name>-m
Test that Alluxio is running as expected
$ alluxio runTests
Alluxio is installed in /opt/alluxio/
by default. Spark, Hive and Presto are already configured to connect to Alluxio.
Note: The default Alluxio Worker memory is set to 1/3 of the physical memory on the instance. If a specific value is desired, set
alluxio.worker.memory.size
in the providedalluxio-site.properties
or in the additional options argument.
Spark on Alluxio in Dataproc
The Alluxio bootstrap also takes care of setting up Spark for you. To run a Spark application accessing data from Alluxio, simply refer to the path as alluxio:///<path_to_file>
. Follow the steps in our Alluxio on Spark documentation to get started.
Presto on Alluxio in Dataproc
The Alluxio initialization script configures Presto for Alluxio. If installing the optional Presto component, Presto must be installed before Alluxio. Initialization action are executed sequentially and the Presto action must precede the Alluxio action.