In the previous article, I described the concept and design of the Structured Data Service in the Alluxio 2.1.0 release. This article will go through an example to demonstrate how it helps SQL and structured data workloads.
Alluxio 2.2.0 is now released since the previous article. I recommend users to update to Alluxio 2.2.0 if trying out this service for the first time. This tutorial requires you have Presto and Hive to be configured together and running.
Step1: Download and Setup Alluxio
Download and Deploy Alluxio 2.2.0
Download the Alluxio 2.2.0 release and deploy Alluxio on your local computer. Detailed instructions can be found here. The following is a summary of the commands mentioned:
$ tar xf alluxio-2.2.0-bin.tar.gz
$ cd alluxio-2.2.0 # this directory corresponds to ${ALLUXIO_HOME}
$ cp conf/alluxio-site.properties.template conf/alluxio-site.properties
$ echo "alluxio.master.hostname=localhost" >> conf/alluxio-site.properties
$ echo "alluxio.master.mount.table.root.ufs=/tmp" >> conf/alluxio-site.properties$ ./bin/alluxio-mount.sh SudoMount
$ ./bin/alluxio format
$ ./bin/alluxio-start.sh local -f
Note that no additional configuration is needed to start the new Structured Data Service.
Install and Configure the Alluxio Presto Connector
The Alluxio Presto Connector is the client for Presto to access Alluxio’s Structured Data Service. In this developer preview version, we need to copy the connector manually to Presto
This connector is bundled as part of the Alluxio 2.2.0 release in the directory ${ALLUXIO_HOME}/client/presto/plugins/
. Copy the directory corresponding to the Presto version into Presto’s plugin directory.
$ cp -R ${ALLUXIO_HOME}/client/presto/plugins/presto-hive-alluxio-319/ \
${PRESTO_HOME}/plugin/hive-alluxio/
Once the connector is installed, it can be used to configure a Presto catalog. Add a new catalog configuration to Presto by creating the following file
$ echo "connector.name=hive-alluxio
hive.metastore=alluxio
hive.metastore.alluxio.master.address=localhost:19998" > ${PRESTO_HOME}/etc/catalog/catalog_alluxio.properties
Restart the Presto server for the connector and configuration to take effect.
Step2: Attach a Hive Metastore to Alluxio Catalog Service
The Alluxio Catalog Service manages the metadata of structured data components such as databases, tables, and schemas. It also tracks the location of the stored data. This developer preview version supports attaching a Hive Metastore as an UnderDatabase, which is an abstraction of other external catalogs and databases, into the Alluxio Catalog service.
To attach the Hive Metastore into the Alluxio Catalog Service, use the “attachdb
” command here:
$ ./bin/alluxio table attachdb hive thrift://localhost:9083 hive_db_name
Step3: Use Alluxio Structured Data Management with Presto
Once a database is attached, the catalog service can be used from Presto. Start the Presto CLI with the Alluxio catalog:$ presto --catalog catalog_alluxio
Any queries run within this CLI will access the Alluxio Catalog Service via the provided connector. The Alluxio Catalog Service will automatically serve the table information from Hive metastore, while transparently using the Alluxio mounted locations.
Transform a Table
Data transformations is a key benefit of working with structured data in Alluxio, particularly when the underlying files consisting of a table are not stored in a compute-optimized fashion. If the files are in CSV format or the table is split among lots of small files, the Alluxio Transformation Service is able to convert the format to parquet or join multiple small files into larger files.To transform the test table in Hive:
$ ./bin/alluxio table transform hive_db_name test_table
For more on Data Transformations, see documentation here.
Try it out!
Alluxio Structured Data Management is an exciting, new effort that provides further benefits for SQL frameworks. Get started with Alluxio Structured Data Service with Presto and let us know if you have any feedback for features and issues in the Alluxio Github repository! On behalf of the entire Alluxio open source community, I invite you to ask questions in our community slack channel whenever you encounter any issues.
.png)
Blog

Alluxio's strong Q2 featured Enterprise AI 3.7 launch with sub-millisecond latency (45× faster than S3 Standard), 50%+ customer growth including Salesforce and Geely, and MLPerf Storage v2.0 results showing 99%+ GPU utilization, positioning the company as a leader in maximizing AI infrastructure ROI.

In this blog, Greg Lindstrom, Vice President of ML Trading at Blackout Power Trading, an electricity trading firm in North American power markets, shares how they leverage Alluxio to power their offline feature store. This approach delivers multi-join query performance in the double-digit millisecond range, while maintaining the cost and durability benefits of Amazon S3 for persistent storage. As a result, they achieved a 22 to 37x reduction in large-join query latency for training and a 37 to 83x reduction in large-join query latency for inference.