Serving Structured Data in Alluxio: Example

March 11, 2020

In the previous article, I described the concept and design of the Structured Data Service in the Alluxio 2.1.0 release. This article will go through an example to demonstrate how it helps SQL and structured data workloads.

Alluxio 2.2.0 is now released since the previous article. I recommend users to update to Alluxio 2.2.0 if trying out this service for the first time. This tutorial requires you have Presto and Hive to be configured together and running.

Step1: Download and Setup Alluxio

Download and Deploy Alluxio 2.2.0

Download the Alluxio 2.2.0 release and deploy Alluxio on your local computer. Detailed instructions can be found here. The following is a summary of the commands mentioned:

$ tar xf alluxio-2.2.0-bin.tar.gz
$ cd alluxio-2.2.0 # this directory corresponds to ${ALLUXIO_HOME}
$ cp conf/alluxio-site.properties.template conf/alluxio-site.properties $ echo "alluxio.master.hostname=localhost" >> conf/alluxio-site.properties $ echo "alluxio.master.mount.table.root.ufs=/tmp" >> conf/alluxio-site.properties
$ ./bin/alluxio-mount.sh SudoMount
$ ./bin/alluxio format
$ ./bin/alluxio-start.sh local -f

Note that no additional configuration is needed to start the new Structured Data Service.

Install and Configure the Alluxio Presto Connector

The Alluxio Presto Connector is the client for Presto to access Alluxio’s Structured Data Service. In this developer preview version, we need to copy the connector manually to Presto

This connector is bundled as part of the Alluxio 2.2.0 release in the directory ${ALLUXIO_HOME}/client/presto/plugins/. Copy the directory corresponding to the Presto version into Presto’s plugin directory.

$ cp -R ${ALLUXIO_HOME}/client/presto/plugins/presto-hive-alluxio-319/ \
${PRESTO_HOME}/plugin/hive-alluxio/

Once the connector is installed, it can be used to configure a Presto catalog. Add a new catalog configuration to Presto by creating the following file

$ echo "connector.name=hive-alluxio
hive.metastore=alluxio
hive.metastore.alluxio.master.address=localhost:19998" > ${PRESTO_HOME}/etc/catalog/catalog_alluxio.properties

Restart the Presto server for the connector and configuration to take effect.

Step2: Attach a Hive Metastore to Alluxio Catalog Service

The Alluxio Catalog Service manages the metadata of structured data components such as databases, tables, and schemas. It also tracks the location of the stored data. This developer preview version supports attaching a Hive Metastore as an UnderDatabase, which is an abstraction of other external catalogs and databases, into the Alluxio Catalog service.

To attach the Hive Metastore into the Alluxio Catalog Service, use the “attachdb” command here:

$ ./bin/alluxio table attachdb hive thrift://localhost:9083 hive_db_name

Step3: Use Alluxio Structured Data Management with Presto

Once a database is attached, the catalog service can be used from Presto. Start the Presto CLI with the Alluxio catalog:
$ presto --catalog catalog_alluxioAny queries run within this CLI will access the Alluxio Catalog Service via the provided connector. The Alluxio Catalog Service will automatically serve the table information from Hive metastore, while transparently using the Alluxio mounted locations.

Transform a Table

Data transformations is a key benefit of working with structured data in Alluxio, particularly when the underlying files consisting of a table are not stored in a compute-optimized fashion. If the files are in CSV format or the table is split among lots of small files, the Alluxio Transformation Service is able to convert the format to parquet or join multiple small files into larger files.To transform the test table in Hive:

$ ./bin/alluxio table transform hive_db_name test_table

For more on Data Transformations, see documentation here.

Try it out!

Alluxio Structured Data Management is an exciting, new effort that provides further benefits for SQL frameworks. Get started with Alluxio Structured Data Service with Presto and let us know if you have any feedback for features and issues in the Alluxio Github repository! On behalf of the entire Alluxio open source community, I invite you to ask questions in our community slack channel whenever you encounter any issues.

Share this post

Blog

Introducing Alluxio S3 Write Cache

For write-heavy AI and analytics workloads, cloud object storage can become the primary bottleneck. This post introduces how Alluxio S3 Write Cache decouples performance from backend limits, reducing write latency up to 8X - down to ~4–6 ms for concurrent and bursty PUT workloads.

Alluxio and Oracle Cloud Infrastructure: Delivering Sub-Millisecond Latency for AI Workloads

Oracle Cloud Infrastructure has published a technical solution blog demonstrating how Alluxio on Oracle Cloud Infrastructure (OCI) delivers exceptional performance for AI and machine learning workloads, achieving sub-millisecond average latency, near-linear scalability, and over 90% GPU utilization across 350 accelerators.

Make Multi-GPU Cloud AI a Reality

If you’re building large-scale AI, you’re already multi-cloud by choice (to avoid lock-in) or by necessity (to access scarce GPU capacity). Teams frequently chase capacity bursts, “we need 1,000 GPUs for eight weeks,” across whichever regions or providers can deliver. What slows you down isn’t GPUs, it’s data. Simply accessing the data needed to train, deploy, and serve AI models at the speed and scale required – wherever AI workloads and GPUs are deployed – is in fact not simple at all. In this article, learn how Alluxio brings Simplicity, Speed, and Scale to Multi-GPU Cloud deployments.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo