Tutorial: Getting started with GCP Dataproc and Alluxio


5 MIN TUTORIAL

Overview

Google Cloud Dataproc is a managed on-demand service to run Spark and Hadoop compute workloads. It manages the deployment of various Hadoop Services and allows for hooks into these services for customizations. Aside from the added performance benefits of caching, Alluxio also enables users to run compute workloads against on-premise storage or even a different cloud provider’s storage i.e. AWS, Azure Blob Store.

Prerequisites

  • Account with Cloud Dataproc API enabled
  • A GCS Bucket
  • gcloud CLI: Make sure that the CLI is set up with necessary GCS interoperable storage access keys. Note: GCS interoperability should be enabled in the Interoperability tab in GCS setting.

A GCS bucket required as Alluxio’s Root Under File System and to serve as the location for the bootstrap script. If required, the root UFS can be reconfigured to be HDFS or any other supported under store.

Setup

PIck an Edition

You can use either the Alluxio Community Edition or the Enterprise Edition with Google Cloud Dataproc.

When creating a Dataproc cluster, Alluxio can be installed using an initialization action

To use the Alluxio Enterprise Edition, first step is to request a Trial Edition License and download the Alluxio EE tarball.

The Alluxio initialization action is hosted in a publicly readable GCS location gs://alluxio-public/enterprise-dataproc/2.0.1-1.0/alluxio-dataproc.sh

  • Once you have the trial license or EE license, that base64 encoded license should be passed using alluxio_license_base64.
  • Base64 encode the license using: $(cat license.json | base64 | tr -d "\n")
  • Host the Alluxio Enterprise Edition tarball in a private location and pass in the location using the parameter alluxio_download_path, e.g. alluxio_download_path=gs://<my-bucket>/alluxio-enterprise-2.0.1-1.0-all.tar.gz.
  • A required argument is the root UFS URI using alluxio_root_ufs_uri.
  • Additional properties can be specified using the metadata key alluxio_site_properties delimited using
$ gcloud dataproc clusters create  \
 --initialization-actions=gs://alluxio-public/enterprise-dataproc/2.0.1-1.0/alluxio-dataproc.sh \
 --metadata alluxio_root_ufs_uri=<uri>,alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=;alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<gcs_secret_access_key>;alluxio_download_path=gs://<my-bucket>/alluxio-enterprise-2.0.1-1.0-all.tar.gz",alluxio_license_base64=$(cat license.json | base64 | tr -d "\n")

Additional files can be downloaded into /opt/alluxio/conf using the metadata key alluxio_download_files_list by specifying http(s) or gs uris delimited using ;

$ gcloud dataproc clusters create <cluster_name> \
--metadata alluxio_root_ufs_uri=<under_storage_address>,alluxio_download_files_list="gs://$my_bucket/$my_file;https://$server/$file"

When creating a Dataproc cluster, Alluxio can be installed using an initialization action

The Alluxio initialization action is hosted in a publicly readable GCS location gs://alluxio-public/dataproc/2.0.1/alluxio-dataproc.sh.

  • A required argument is the root UFS URI using alluxio_root_ufs_uri.
  • Additional properties can be specified using the metadata key alluxio_site_properties delimited using;
$ gcloud dataproc clusters create <cluster_name> \ 
 --initialization-actions=gs://alluxio-public/dataproc/2.0.1/alluxio-dataproc.sh \
 --metadata alluxio_root_ufs_uri=<gs://my_bucket>,alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<gcs_access_key_id>;alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<gcs_secret_access_key>"

Additional files can be downloaded into /opt/alluxio/conf using the metadata key alluxio_download_files_list by specifying http(s) or gs uris delimited using ;

$ gcloud dataproc clusters create <cluster_name> \
--metadata alluxio_root_ufs_uri=<under_storage_address>,alluxio_download_files_list="gs://$my_bucket/$my_file;https://$server/$file"

NeXT STEPS

The status of the cluster deployment can be monitored using the CLI.

$ gcloud dataproc clusters list

Identify the instance name and SSH into this instance to test the deployment.

$ gcloud compute ssh <cluster_name>-m 

Test that Alluxio is running as expected

$ alluxio runTests

Alluxio is installed in /opt/alluxio/ by default. Spark, Hive and Presto are already configured to connect to Alluxio.

Note: The default Alluxio Worker memory is set to 1/3 of the physical memory on the instance. If a specific value is desired, set alluxio.worker.memory.size in the provided alluxio-site.propertiesor in the additional options argument.

Spark on Alluxio in Dataproc

The Alluxio bootstrap also takes care of setting up Spark for you. To run a Spark application accessing data from Alluxio, simply refer to the path as alluxio:///<path_to_file>. Follow the steps in our Alluxio on Spark documentation to get started.

Presto on Alluxio in Dataproc

The Alluxio initialization script configures Presto for Alluxio. If installing the optional Presto component, Presto must be installed before Alluxio. Initialization action are executed sequentially and the Presto action must precede the Alluxio action.