How does Cloudera’s hybrid cloud approach work and how does it compare with Alluxio’s “zero-copy” bursting approach?

Cloudera recently introduced the Enterprise Data Cloud platform that helps enterprises solve their on-premises capacity challenge by bursting workloads to cloud, eliminating unpredictable scenarios and complementing data center capacity. 

The Lift and Shift approach

To burst a workload into the cloud, Cloudera’s Management Console enables you to run your existing Cloudera platform in the cloud through a lift-and-shift approach. You first create a new Hadoop cluster in the cloud that you want to provision it in. You can pick the services and frameworks you want to provision and can also pick based on existing blueprints. The screenshot below shows how:  

Next, you pick the instances you would like to use on AWS or any public cloud, for example do you want to run your cluster on m4.2xlarge, etc. 

However, the data for this cluster needs to be replicated back and forth. So you need to setup a replication policy as shown below. 

You have to specify a target location for the data you copy from your on-prem Hadoop cluster, the example below shows S3. Once it is scheduled, you have to wait for your data to be copied over. 

What is “Zero-copy” bursting? 

The “Zero-copy” bursting approach doesn’t actually copy data, it syncs data and the user has the option of running only in memory, so the data is not persisted to the cloud object store. Frameworks like Spark, Presto and Hive seamlessly run on top of Alluxio with the data in memory. 

How does the “Zero-copy” bursting approach compare with Cloudera’s lift and shift approach? 

While Cloudera’s manager provides a great UI to help with lift and shift, there are several issues with this. 

  1. It may not always be easy to identify datasets for many workloads and so replicating the data can be challenging. 
  2. Significantly more data may be transferred than the workload actually requires. This increases network bandwidth utilization as well as storage costs in the cloud. 
  3. Since a new copy of data is created, if there are changes that happen on the Hadoop cluster while the job is running in the cloud, the copies may end of being out of sync and you may end up with stale or incorrect data. 
  4. If there are compliance or security regulations that prevent data to be persisted in the cloud, the Cloudera lift and shift approach does not work as it copies data to object stores like S3, GCS and Azure Blog store.

To learn more, see Zero-Copy Bursting