A Better Big Data Ecosystem with Hadoop and Hitachi Content Platform Part1

August 28, 2018

In this blog, I’ll explore the challenges our customers are facing with storing data long term in Hadoop, and discuss what the Hitachi Content Platform team is doing to help our customers solve these challenges with the help of Alluxio.

The Big Data Problem with Hadoop

Data is at the center of our digital world and for years Hadoop has been the go-to data processing platform because it is fast and scalable. While Hadoop has solved the data storage and processing problem for the last ~10 years, it achieves this by scaling storage and compute capacity in parallel. As a result, Hadoop environments have continued to expand compute capacity well beyond their needs as more and more of the storage is consumed by older, inactive data.

Although HDFS is effective at storing small-to-mid size repositories of data, HDFS becomes vastly more costly and inefficient as storage needs expand, since this requires increasing both storage and compute. HDFS also relies on data replication (storing multiple copies of each object) for protection. As these data sets grow into the petabytes the growing cost of old data and idle compute in your Hadoop ecosystem will become unsustainable.

Offloading Solution

Every storage administrator is thinking about how they can reduce the cost of data storage while still getting the best performance out of their hardware. With this in mind, Apache Hadoop has been continually improving the concept of tiered storage, and in Hadoop 2.6 many improvements to the tiered storage concept have been added. These features allow you to attach a storage policy to a directory, categorize it as Hot, Warm, Cold, or Frozen, and define how many block replicas of the data to keep for that policy. Although storage administrators can reduce the number of copies of data they have to store, they still have the challenge of compute sitting idle. This is where offloading data outside of HDFS can offer huge benefits.

How Can Object Storage Help Reduce My HDFS Footprint?

Object storage offers significant cost savings to customers by increasing density and providing greater control over data. Offloading data from Hadoop to an object store like Hitachi Content platform (HCP) enables customers to unlock a new, cheaper storage tier. The Hitachi Content Solutions engineering team is working with Alluxio to bring in memory caching and object store efficiencies to existing big data challenges.

Alluxio is a memory-speed, virtually-distributed storage layer that enables any application to interact with any data from any storage at memory speed. With Alluxio and HCP, HDFS applications can virtualize object storage and move data from HDFS to object storage through a single protocol and interface.

Why Hitachi Content Platform and Alluxio?

When configuring Hitachi Content Platform as a understore or a mounted directory in the Alluxio filesystem, applications can simplify and expand their data ecosystem. In this environment, Hadoop applications can read and write data to and from the HCP and Hadoop filesystems. Applications can move data from HDFS to Object storage as simply as moving data from one directory to another.

With Alluxio caching, data can be recalled from HCP to the Alluxio in-memory file system on the Hadoop node, enabling memory speed analytics with object store savings. With HCP and Alluxio, applications can unify data access protocols and offload cold data to cost effective storage.

Looking Ahead

In Part 2, I’ll discuss how the new functionality in Hadoop 3.1 brings object storage closer to the Hadoop ecosystem and how future functionality will continue to simplify big data management. Read the next Blog Post.

Check out our Demo Videos and download Alluxio to get started.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo