What is Apache Hadoop
If you’re new to building big data applications, Apache Hadoop is a distributed framework for managing data processing and storage for big data applications running in clustered systems. It consists of 5 modules – a distributed file system (aka HDFS or Hadoop Distributed File System), MapReduce for parallel processing of datasets, YARN which manages resources of the systems storing the data and running the analysis, HBase which is designed for applications that need random, real-time, read/write access to data, and Hive which is a SQL-on-Hadoop engine that offers a SQL interface on data stored in HDFS.
Using Hadoop for Big Data Workloads
It is becoming imperative for enterprises to decrease cost of software, hardware, storage and administration time associated with growing amounts of data. With the capability to store more data at a fraction of the cost, solutions like Hadoop have become quite popular for big data workloads.
Why Offload CPU and I/O from Hadoop
Hadoop has its own set of challenges, and here are a few reasons why you might consider offloading CPU and I/O from Hadoop –
- Increased I/O load on Hadoop – When data is written into HDFS, each data block is by default replicated 3 times. This is so the cluster can sustain the loss of a disk, or even a server, and the data is not lost forever. The cost of this redundancy is that it requires 3 times more storage capacity, and that distributing data across the servers can create I/O bottlenecks. Additionally, MapReduce jobs generate ‘shuffle traffic’ when data is moved between Hadoop servers, creating more I/O problems. Add to it a heavy I/O app and you have a system that is slowly stalling.
- Low memory for high-performing jobs – In order to optimize the performance of the jobs, data must be stored in main memory. You must also allocate enough heap memory for your job executors or else the JVM Garbage Collector can become resource-hungry. Additionally, running CPU intensive tasks such as workloads reading from parquet file format can peak CPU utilization. To solve for this, jobs may need to be serialized, or broken down into sub tasks that can operate on smaller chunks of data that can fit in memory. The net result is lower workload throughput.
How Alluxio can help
Alluxio is not a replacement for HDFS. Instead, it is a new abstraction layer between compute and distributed/cloud storage systems including HDFS providing a unified file system namespace. In this case, you can run a separate compute cluster either in the public cloud or in another datacenter to offload the additional CPU and the read-intensive I/O operations. This offloading will free up more resources in the Hadoop cluster. With Alluxio, you also get workload acceleration — hot data is cached in memory, which greatly accelerates the performance of MapReduce jobs that are accessing this data. In addition, Alluxio delivers predictable performance, allowing the system to guarantee a certain quality of service.
If you’re a MapR user, Alluxio also makes it easy to offload your MapR/HDFS compute to any object store, cloud or on-prem, and run all of your existing job as-is on Alluxio + the object store of your choosing.