What is the WANdisco Hybrid Data Lake?
The WANdisco Fusion solution helps replicate and copy data in on-premises Hadoop Distributed File System (HDFS) clusters with an AWS Data Lake using S3. It provides continuous data transfer and synchronization to help maintain data consistency. This allows you to run Hadoop applications using AWS EMR or other methods against the copied data in S3. WANdisco also provides a single, virtual namespace to integrate multiple stores. WANdisco primarily implements this using a managed distCP-like approach to this solution.
It requires WANdisco software to be installed both on-premises and in the cloud on EC2 instances:
How is the Alluxio solution different?
First of all, Alluxio integrates with the analytics and AI workload. So using Alluxio for zero-copy bursting to AWS has one main fundamental difference: there is no need to replicate, copy, sync, store, or monitor a second set of data in the cloud. The data does not get stored on a cloud storage system like S3, GCS but resides in memory in the Alluxio cache. This enables a simplified architecture. Your data can reside on-premises but the compute can run elastically in the cloud. With Alluxio, performance of read-heavy analytic workloads will be as good as local. How? Alluxio has intelligent multi-tiering across work node RAM, SSD, and HDDs to cache only the data the job needs, that means the hot data.
There are other differences, here’s a quick table which covers the considerations:
|WANdisco + AWS Data Lake||Alluxio for Zero-Copy Bursting|
|Namespace||Unified namespace for AWS S3||Unified namespace across any storage (ex.HDFS and S3)|
|Data Storage||On-prem and S3||On-prem only, no copies stored in cloud|
|Pay-As-You-Go (PAYG) options||Yes, via AWS Marketplace||Yes, via AWS Marketplace|
|Analytics||Bring your own||Integrated with EMR or Bring your own|
|Deployment model||Install on-prem and in AWS||Install in AWS only|
|Costs||Fusion SW, HW, and instance costs plus AWS EMR costs||Alluxio SW only plus AWS EMR costs|
|Open-source based||No||Yes, Alluxio is based on the Alluxio open source project|
So as you can see above, there are major differences between using Alluxio and a data copy solution from WANdisco. This is true of all solutions involving copying data and running analytics on them.
Here is an example of the Alluxio architecture from the DBS Bank talk at O’Reilly Strata Data Conference 2019 in New York City. Note: that DBS also uses Alluxio in their datacenter for consistent high performance of on-prem analytic workloads. That is not required for Zero-Copy Bursting to the cloud:
The left side of the above slide is the on-prem private cloud in Singapore. The right side is the AWS Singapore region.
For more information see: Zero-copy bursting.