This article introduces Structured Data Management (Developer Preview) available in the latest Alluxio 2.1.0 release, a new effort to provide further benefits to SQL and structured data workloads using Alluxio.
Today, many users deploy Alluxio in analytics or AI platform to provide unified data access while transparently caching the relevant data for accelerated data IO. No matter the computation framework being used, Alluxio can provide the abstraction on files, directories, and objects in a logical “Alluxio File System”.
However, widely-used analytics engines such as Presto, Apache Spark SQL, or Apache Hive, typically consume structured data in different “tables” consisting of “rows” and “columns”, rather than “offset” and “length” in files or objects. This gap creates multiple challenges and inefficiency. Extra service and coordination is required to map “tables” and “partitions” to files or objects. The data is often stored in a storage-optimized way, which is not accessible in a compute-optimized way, which can negatively impact the data access efficiency.
For example, a data analyst may be running a SQL query on Hive tables which has thousands of small files, stored in CSV text files, which is not efficient for computation. However, the data may be ingested by an outdated service, managed by a separate team, so updating the format of the stored data may not be feasible.
Our goal is to deliver physical data independence, where the logical access of data by the SQL engines is independent from the physical format of the stored data. Since Alluxio is the ecosystem layer between compute and storage, Alluxio is in a great position to bridge the gap between SQL engines and file or object-based storage systems, and enable physical data independence.
What’s Alluxio Structured Data Management
Alluxio Structured Data Management is a new set of services that enables structured data applications to interact with data more efficiently. With Structured Data Management, Alluxio can expose the data to be effectively accessed by the SQL engines, independent of how and where the data is stored.
There are two major points of focus that drive the direction of Alluxio Structured Data Management:
- Provide structured data APIs which focuses on how SQL engines interact with data。 This will introduce new APIs relevant to structured data concepts, like tables, schemas, rows, and columns.
- Cache Logical Data Access which focuses on caching what SQL engines want. In other words, Alluxio will cache compute-optimized data.
To achieve these goals, there are several major requirements to build in Alluxio Structured Data Management:
- Structured Data Client: the client is the gateway for SQL engines to interact with the various components of Alluxio Structured Data Management.
- Structured Data Caching and Metadata: this component stores and caches compute-optimized data for SQL engines, and manges the metadata for the cached data. This enables Alluxio to be aware of the structure of data, for schema-aware optimizations.
Transformation Service: the Transformation service is responsible for transforming existing data into a compute-optimized representation. This enables the physical data independence of compute-optimized data from storage-optimized data.
The Alluxio 2.1.0 release brings the initial implementations of these components with this developer preview. The primary use case for the developer preview is the use case with Presto using the Hive Metastore via the hive connector. The developer preview of Alluxio Structured Data Management introduces several new components in the ecosystem.
- Structured Data Client for Presto (a Presto Connector for Alluxio)
- Catalog Service
- Basic Transformation Service
Presto Connector for Alluxio
A new Presto connector for Alluxio is available with the developer preview of Alluxio Structured Data Management. All the interactions with other Alluxio components go through the Alluxio connector. This allows easy integration and configuration of Alluxio with Presto.
The new Alluxio Catalog Service manages the metadata of structured data in the system. It is responsible for all the database, table, and schema information, as well as the location of all the stored data.
The major new concept in the catalog service is the UnderDatabase. The UnderDatabase is an abstraction of other external catalogs and databases. This abstraction enables the Alluxio Catalog Service to be able to connect to different catalogs and to gather information about structured data. The UnderDatabase abstraction is equivalent to the UnderFilesystem abstraction for the Alluxio Filesystem. The developer preview includes a Hive Metastore implementation of the UnderDatabase.
The main way the user interacts with the catalog service is to “attach” a database to the catalog. Attaching a database associates an Alluxio catalog database with an existing catalog database. For example, when you attach a Hive database named “hive_db” to the Alluxio catalog database name “alluxio_db”, this creates a connection between the 2 databases, and whenever the Alluxio Catalog Service is accessed for the database “alluxio_db”, it represents the Hive database “hive_db”. Attaching an existing database to the catalog service is equivalent to mounting an existing filesystem to the Alluxio filesystem.
The Alluxio Catalog Service provides several benefits for the Presto with Alluxio environment. First of all, deployment of Alluxio is very simple. Users just need to attach an existing Hive database to the Alluxio Catalog Service, and then configure the Alluxio Presto connector to point to the Alluxio Catalog Service, in order to deploy Alluxio with Presto. There is no longer a need to change any table locations in the Hive metastore, or to restart or reconfigure any Hive services.
The Alluxio Catalog Service also enables additional schema-aware optimizations for structured data. For example, once the Hive metastore is attached to the Alluxio Catalog Service, the catalog service will automatically mount the appropriate table locations, and automatically serve the table metadata with the Alluxio locations.
The developer preview also includes the Alluxio Transformation Service. This transformation service has the primary goal to transform data into a compute-optimized representation of the data, which is independent from the storage-optimized format. This enables physical data independence.
The developer preview includes 2 types of transformations available for tables: coalesce and format conversion.
- Coalesce: Typically, having too many files for a table is inefficient for SQL engines to process. Therefore, the coalesce transformation enables the data to be combined into fewer files. By transforming the data into fewer files, this allows the data to be in a more compute-optimized format, regardless of how the data was initially stored.
- Format Conversion: Certain types of files are more efficient to read and process. For example, columnar and binary formats (like parquet and ORC) are usually more efficient to process than raw text files. Therefore, in this developer preview, the available format conversion is the CSV to Parquet conversion.
We are excited to introduce the Developer Preview of Alluxio Structured Data Management in the Alluxio 2.1.0 release! The initial implementations of the major components are available with this developer preview. In the next article, we will go through a simple example step-by-step to illustrate how to use Structured Data Management in Alluxio.
If you are not sure about your use case, feel free to ask questions in our Alluxio community slack channel.