How does the WANdisco Hybrid Data Lake Solution in AWS compare to zero-copy bursting to the cloud?

What is the WANdisco Hybrid Data Lake?

The WANdisco Fusion solution helps replicate and copy data in on-premises Hadoop Distributed File System (HDFS) clusters with an AWS Data Lake using S3. It provides continuous data transfer and synchronization to help maintain data consistency. This allows you to run Hadoop applications using AWS EMR or other methods against the copied data in S3. WANdisco also provides a single, virtual namespace to integrate multiple stores. WANdisco primarily implements this using a managed distCP-like approach to this solution.

It requires WANdisco software to be installed both on-premises and in the cloud on EC2 instances:

A close up of a mapDescription automatically generated

How is the Alluxio solution different?

First of all, Alluxio integrates with the analytics and AI workload. So using Alluxio for zero-copy bursting to AWS has one main fundamental difference: there is no need to replicate, copy, sync, store, or monitor a second set of data in the cloud. The data does not get stored on a cloud storage system like S3, GCS but resides in memory in the Alluxio cache. This enables a simplified architecture. Your data can reside on-premises but the compute can run elastically in the cloud. With Alluxio, performance of read-heavy analytic workloads will be as good as local. How? Alluxio has intelligent multi-tiering across work node RAM, SSD, and HDDs to cache only the data the job needs, that means the hot data.

There are other differences, here’s a quick table which covers the considerations:

	WANdisco + AWS Data Lake	Alluxio for Zero-Copy Bursting
Namespace	Unified namespace for AWS S3	Unified namespace across any storage (ex.HDFS and S3)
Data Storage	On-prem and S3	On-prem only, no copies stored in cloud
Pay-As-You-Go (PAYG) options	Yes, via AWS Marketplace	Yes, via AWS Marketplace
Analytics	Bring your own	Integrated with EMR or Bring your own
Data Synchronization	Yes	Yes
Deployment model	Install on-prem and in AWS	Install in AWS only
Costs	Fusion SW, HW, and instance costs plus AWS EMR costs	Alluxio SW only plus AWS EMR costs
Open-source based	No	Yes, Alluxio is based on the Alluxio open source project

So as you can see above, there are major differences between using Alluxio and a data copy solution from WANdisco. This is true of all solutions involving copying data and running analytics on them.

Here is an example of the Alluxio architecture from the DBS Bank talk at O’Reilly Strata Data Conference 2019 in New York City. Note: that DBS also uses Alluxio in their datacenter for consistent high performance of on-prem analytic workloads. That is not required for Zero-Copy Bursting to the cloud:

A screenshot of a cell phoneDescription automatically generated

The left side of the above slide is the on-prem private cloud in Singapore. The right side is the AWS Singapore region.

For more information see: Zero-copy bursting.

Tags: aws s3, hadoop, hdfs, hybrid bursting, hybrid cloud

Data Architecture Answers