This article walks through the journey of a startup HashData in Beijing to build a cloud-native high-performance MPP shared-everything architecture leveraging object storage as the data persistence layer and Alluxio as a data orchestration layer in the cloud.
HashData was founded in 2016 by a group of open source data veterans from Pivotal, Teradata, IBM, Yahoo! and etc. Its flagship product, HashData Warehouse (HDW), is a data warehousing service built for the cloud with fully compatible analytics interfaces of the Greenplum. HDW’s unique architecture, multi-cluster shared data repository, delivers proven breakthroughs in performance, concurrency, elasticity, and simplicity. Different from the shared-nothing architecture adopted by many traditional MPP systems where compute is tightly coupled with storage, HDW employs the shared-everything architecture with decoupled and independent object storage.
The major challenge with this architecture is in performance. Compared to conventional block storage, object storage typically provides lower performance. In this article, we will illustrate how HDW leverages Alluxio as the data orchestration layer to eliminate the performance penalty introduced by object storage while benefiting from its scalability and cost-effectiveness.
Why Object Storage Service
Today Object Storage Service(OSS) is ever important in the cloud-native architecture. As reported in this blog post, it provides higher availability, elasticity, durability with much lower costs. More and more products & services support OSS as their persistent file system.
Based on our observation, for many customers with more than 100s of TB data, their bill is mostly from storage costs rather than computing costs. One major reason is that computing is fully on demand, while the storage capacity cannot be reclaimed before data is deleted. So, storage cost is an important factor when a customer is considering various data analytics options. The following table compares the price between block storage and object storage at the PEK3 region of QingCloud, a full-stack ICT service and solution (including Public Cloud Service) provider in China. The cost of OSS is about 1/4 of the one-replica block storage, and 1/5 of the multi-replica block storage.
|Block Storage(one-replica)||100G, 0.0792 RMB/hour||100G, 0.0792 RMB/hour|
|Block Storage(multi-replica)||100G, 0.099 RMB/hour||100G, 0.099 RMB/hour|
|QingStor(OSS)||500G, 0.1035 RMB/hour||100G, 0.0207 RMB/hour|
Elasticity also affects storage cost. Although the block storage supports online expansion, the Elastic Computing machine (to which the to-be-expanded block storage volume attaches) would lead to a short outage of the HDW cluster during the expansion. Thus, customers usually preserve extra block storage volume to avoid another expansion in a short time, resulting in a higher cost.
In summary, the storage cost of a OSS-based data warehouse solution can be about 1/10 the cost of employing the traditional block storage.
Let us look at a typical scenario in data analytics, especially in IoT and Telecom industries: As time goes by, more and more data is generated and loaded into the data warehouse system. Usually, these data need to be kept for 18 or 36 months, due to either application requirements or regulation policies. Besides, all the data must be queryable anytime, i.e., online analysis; on the other hand, most of the single queries (if not all of them) touch only a small portion of the whole data store (e.g., the data generated in the latest week or the latest month, or between Sep. 2016 and Oct. 2016).
In other words, the requirement of compute resource is stable, while that of storage resource is constantly increasing. Under the traditional shared-nothing architecture, to address this issue, we need to perform the cluster expansion operation (i.e., adding more worker nodes), because the volume of block storage mounted to an Elastic Computing machine has an upper limit. In such an expansion scenario, the newly added computing resource is totally a waste, let alone the fact that the cluster expansion is an error-prone and time-consuming procedure.
Contrary to the rigidity of traditional MPP solutions, the data storage capacity of a HDW cluster is approximately infinite. Storage layer expansion is totally transparent to the computation layers. Actually, from the perspective of a HDW user, there is no concept of storage expansion. Users can just keep loading data into a HDW cluster, and process the targeted data to support their business decisions.
As aforementioned, HDW inherits rich data analytics features from Greenplum Database. One of those amazing functionalities is to access native data on object storage through external tables. Compared to external tables, HDW’s OSS-based built-in tables have the following advantages:
- Built-in tables support update/delete and index operations. And most importantly, built-in tables fully support the ACID transaction. None of these features is supported by external tables.
- Built-in tables can achieve better data compression. Currently, external tables support three common data formats: TXT, CSV and ORC. Per our observations, customers usually prefer TXT/CSV with the ZIP compression algorithm to ORC for simplicity. With built-in tables, we can employ more sophisticated encoding strategies (for example, dictionary encoding and variable-length encoding) and compression algorithms (for example, zstd).
- Built-in tables have better query performance. The data layout of built-in tables, either in-memory or on-disk, is highly optimized for HDW’s execution engine.
Speed Up OSS Access Using ALLUXIO
Despite object storage provides a lot of benefits for big data systems, it has lower I/O performance compared to block storage. On one hand, as the technologies of object storage keep evolving, currently, most object storage services can provide over 100MB/s GET throughput and 40MB/s PUT throughput per HTTP connection. When enabling multi-threaded GET, Multi-Part Upload and data compression, in general, the overall I/O throughput can match the processing throughput of the database executor.
On the other hand, when OSS itself is not the bottleneck, in a public cloud environment, the aggregate bandwidth of the software-defined network (SDN) where an HDW cluster resides usually has an upper limit, which would restrict the I/O performance when the HDW cluster accesses to data on object storage. Secondly, customers should pay for every OSS HTTP request. Thirdly but not lastly, HDW provides full support of various index algorithms, including B-Tree, Bitmap, and GiST. Customers can leverage those index algorithms to accelerate point queries through index scan operations, and achieve a sub-second query time over trillions of tuples. Another scenario is querying over small dimension tables. For those cases, OSS throughput is not the issue but the latency.
To address the above issues, we chose Alluxio as the intelligent caching layer to accelerate I/O performance. Alluxio, formerly known as Tachyon, is the world’s first data orchestration technology for analytics and AI in the cloud. It bridges the gap between computation frameworks and storage systems, bringing data from the storage tier closer to the compute frameworks and makes it easily accessible enabling applications to connect to numerous storage systems through a common interface. Alluxio’s memory-first tiered architecture enables data access at speeds orders of magnitude faster than existing solutions.
With HDW and Alluxio, cached data is horizontally partitioned to each worker. Every HDW segment caches data to its local Alluxio worker. The Alluxio master node is deployed together with the HDW master node. All user data operations are performed through Alluxio interfaces. The high level storage architecture of HDW is demonstrated in the following figure.
Alluxio plays an important role in the Big Data ecosystem. Many open source big data systems share similar technology stacks, including Java-based or JVM-based and light-weighted transaction model, which make Alluxio a perfect match for them. However, there are still a large number of distributed systems built with totally different philosophy. In the following sections, we share our experiences learned when leveraging Alluxio to build a cloud native MPP database.
C/C++ Native Client
Alluxio provides several programming language bindings, including Java, Go and Python, and also supports REST APIs for all other languages through an HTTP Proxy. There is no official C/C++ native client currently, although the open source project named liballuxio provides C/C++ Alluxio APIs through calling the native Java client using JNI. To improve system performance, stability, flexibility, and maintenance, we decided to implement a native C/C++ client for Alluxio, rather than using liballuxio directly.
In HashData, we built a C/C++ client named liballuxio2 which interacts with the Alluxio server through RPC calls. The interactive procedure among different Alluxio components is based on thrift/protobuf/netty. As a native client, liballuxio2 employs the unified data interchange format of Alluxio by thrift/protobuf. Currently liballuxio2 supports Alluxio version 1.5.0.
Unified Object Storage SDK
Alluxio supports various object storage services, including S3, Azure Blob Store, GCS, AliOSS and etc., as the Under File System(UFS), and provides a framework through which users can easily extend it to support a new object storage. However, we took a different approach for the following reasons:
- Different object storage services support different access optimization strategies for higher READ/WRITE performance, including Multi-Part Upload and Pipelining Download. Building a C/C++ native OSS library is a better way to leverage those optimization opportunities.
- As a startup, HashData wants to support various object storage services provided by local public cloud providers, besides the aforementioned multi-national giants. A unified C/C++ OSS client library could greatly reduce R&D efforts, freeing developers from worrying about the complexity of heterogeneous object storage services.
- An Alluxio client can use the delegate/non-delegate mode to control where a UFS operation is processed. In the non-delegate mode, the Alluxio C/C++ native client is responsible for the OSS read/write operations. The Alluxio server only needs to maintain metadata and block based caching I/O streams.
We built a unified C/C++ native library to support object storage services of top ranking cloud platforms in the local and global market, and integrated it into the liballuxio2 library. Users can still choose the delegate mode to keep OSS operations performed by the Alluxio server.
Alluxio supports tiered storage in order to manage other storage types in addition to memory. Currently, three storage types/tiers, including MEM(Memory), SSD (Solid State Drives) and HDD (Hard Disk Drives), are supported.
In the beginning, we kept the default tiered setting, i.e., data was cached to the MEM tier first, and evicted to SSD/HDD once the upper tier was full. However, when we started to run stress tests with a large volume of data, the cached data could not be evicted if all files in the MEM tier are still open. For typical HDW usage workloads, the capacity of the MEM tier (to save cost, we usually don’t allocate too much memory for an EC machine), is too little to cache the target data.
Thus, we changed the default writing tier to SSD. For the more critical case that data exceeds the capacity of the local disk, the HDW database kernel would tell the C/C++ native Alluxio client liballuxio2 not to cache data touched by those running queries. Many sophisticated Alluxio read/write/cache strategies are determined by the HDW database kernel dynamically.
As a Relational Database Management System (RDBMS), HDW fully supports the ACID distributed transaction. Different from many HDFS-based big data analytics systems whose atomic write is implemented by leveraging HDFS’s atomic rename feature, HDW’s support of transaction is built in the database kernel. A transaction commits successfully only when all open-for-write files are persistent on object storage successfully; otherwise, the transaction is aborted, and all open-for-write files (including data) are cleaned up.
To guarantee data durability, all Alluxio writes are directly written to object storage without caching.
We use the TPC Benchmark™H (TPC-H) to evaluate our solution. TPC-H is a decision support benchmark, consisting of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance.
The purpose of this benchmark is not to show how fast HDW can run against TPC-H queries, but to show the effectiveness of Alluxio improving I/O performance. Thus, in the following benchmark tests, we set HDW’s execution engine to exactly the same as the one of open source Greenplum, rather than the proprietary one. Also, this evaluation was conducted about one year ago, using the first stable version of Alluxio-based HDW, which had the same version of the database kernel as the one of the version 5.0 Greenplum.
- DataSet Size: 100GB
- Cluster: 1 master node, 8 segment (computation) nodes
- Master Node: QingCloud Super-High Performance Instance with 2 CPU, 4GB Memory
- Segment Node: QingCloud Super-High Performance Instance with 4 CPU, 8GB Memory
- Alluxio: Version 1.5.0 with 2 GB Memory on each worker node
- Greenplum 5.0: We take this version of open source Greenplum as the baseline.
- HashData: HDW reads the target data cached in Alluxio tiers loaded from remote object storage.
Based on our observation, the I/O throughput of OSS is at least 3x lower than that of the local disk. However, from this test result, there is only about 30% performance degradation. This is because TPC-H is an OLAP query benchmark, which contains more CPU intensive tasks than I/O intensive ones. In general, the cold read only happens when data is touched for the first time. With the Alluxio caching system, the number of cold read operations was greatly reduced. As shown in Figure 3, the performance of HDW with Alluxio cached read is equally the same as that of the traditional shared-nothing MPP, meanwhile enabling all the benefits using object storage.
In this article, we show that, through a carefully designed storage and caching layer, a cloud-native data warehouse can leverage Alluxio to eliminate the performance penalty of object storage, while enjoying its scalability and cost-effectiveness.