A Better Big Data Ecosystem with Hadoop and Hitachi Content Platform Part1

August 28, 2018

In this blog, I’ll explore the challenges our customers are facing with storing data long term in Hadoop, and discuss what the Hitachi Content Platform team is doing to help our customers solve these challenges with the help of Alluxio.

The Big Data Problem with Hadoop

Data is at the center of our digital world and for years Hadoop has been the go-to data processing platform because it is fast and scalable. While Hadoop has solved the data storage and processing problem for the last ~10 years, it achieves this by scaling storage and compute capacity in parallel. As a result, Hadoop environments have continued to expand compute capacity well beyond their needs as more and more of the storage is consumed by older, inactive data.

Although HDFS is effective at storing small-to-mid size repositories of data, HDFS becomes vastly more costly and inefficient as storage needs expand, since this requires increasing both storage and compute. HDFS also relies on data replication (storing multiple copies of each object) for protection. As these data sets grow into the petabytes the growing cost of old data and idle compute in your Hadoop ecosystem will become unsustainable.

Offloading Solution

Every storage administrator is thinking about how they can reduce the cost of data storage while still getting the best performance out of their hardware. With this in mind, Apache Hadoop has been continually improving the concept of tiered storage, and in Hadoop 2.6 many improvements to the tiered storage concept have been added. These features allow you to attach a storage policy to a directory, categorize it as Hot, Warm, Cold, or Frozen, and define how many block replicas of the data to keep for that policy. Although storage administrators can reduce the number of copies of data they have to store, they still have the challenge of compute sitting idle. This is where offloading data outside of HDFS can offer huge benefits.

How Can Object Storage Help Reduce My HDFS Footprint?

Object storage offers significant cost savings to customers by increasing density and providing greater control over data. Offloading data from Hadoop to an object store like Hitachi Content platform (HCP) enables customers to unlock a new, cheaper storage tier. The Hitachi Content Solutions engineering team is working with Alluxio to bring in memory caching and object store efficiencies to existing big data challenges.

Alluxio is a memory-speed, virtually-distributed storage layer that enables any application to interact with any data from any storage at memory speed. With Alluxio and HCP, HDFS applications can virtualize object storage and move data from HDFS to object storage through a single protocol and interface.

Why Hitachi Content Platform and Alluxio?

When configuring Hitachi Content Platform as a understore or a mounted directory in the Alluxio filesystem, applications can simplify and expand their data ecosystem. In this environment, Hadoop applications can read and write data to and from the HCP and Hadoop filesystems. Applications can move data from HDFS to Object storage as simply as moving data from one directory to another.

With Alluxio caching, data can be recalled from HCP to the Alluxio in-memory file system on the Hadoop node, enabling memory speed analytics with object store savings. With HCP and Alluxio, applications can unify data access protocols and offload cold data to cost effective storage.

Looking Ahead

In Part 2, I’ll discuss how the new functionality in Hadoop 3.1 brings object storage closer to the Hadoop ecosystem and how future functionality will continue to simplify big data management. Read the next Blog Post.

Check out our Demo Videos and download Alluxio to get started.

Share this post

Blog

20x Faster Training Data Reads with Alluxio and Ray on Anyscale: A Cross-Region Benchmark

Alluxio and Anyscale benchmark achieves 20x faster cross-region data reads for AI training workloads on GCS.

Alluxio AI 3.9 Brings Checkpoint Acceleration to Any AI Training Framework

Alluxio AI 3.9 introduces POSIX Write Cache, eliminating the checkpoint write bottleneck in distributed training with 7.6 GiB/s per node throughput and sub-2ms P99 latency. Get all of the details here!

Alluxio AI 3.8: Two New Breakthrough Features for Faster Object Storage Writes and Faster Model Loading

Learn about the new features in Alluxio AI 3.8 designed to eliminate two of the most painful bottlenecks in modern AI pipelines. Introducing Alluxio S3 Write Cache, which dramatically reduces object store write latency and improves write-heavy workload performance, and Safetensors Model Loading Acceleration that delivers near-local NVMe throughput for model weight loading

‍

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo