How do you access data from both HDFS and cloud storage?

Tags: , , , ,

Problem

Sometimes big data analytics need process input data from two different storage systems at the same time. For instance, a data scientists may need to join two tables one from a HDFS cluster and one from S3. 

Existing Solutions

Certain computation frameworks may be able to connect to storage systems including HDFS and popular cloud storages like S3 or Azure blob store. For example, in Hive one can point each external table to a different storage system in HIive metastore. While straightforward to set up, this approach may have performance issues. For example, running the computation in a data warehouse but reading data from cloud storage can be slow even the compute is colocated with HDFS. Similarly, running a workload in cloud environment may be efficient to read cloud storage but fetching data from the on-premise HDFS can be slow. This problem can be worse if the same input data is read repeatedly for machine-learning or scientific computing algorithms.

Another common solution is to prepare the input data by moving the input data from one cluster (e.g. the cloud storage) to HDFS. This typically requires data engineers to build ETL pipelines before hand and creates complexity in data management and synchronization.

How Alluxio Helps

Alluxio can serve as a data abstraction layer on top of both HDFS and S3 or other cloud storage systems, providing a unified file system namespace. In this case, data applications can assume one “logical file system” with different directories linking to external storage systems. Reading data from remote storage is transparent to the user application. Alluxio will also provide benefits like data caching to speed up repeated reads on the same set of files.