Data Orchestration: Simplifying Data Access for Analytics

October 5, 2022

Hope Wang

Originally published on Eckerson.com: https://www.eckerson.com/articles/data-orchestration-simplifying-data-access-for-analytics

The problem with data modernization initiatives is that they result in distributed datasets that impede analytics projects. As enterprises start their cloud migration journey, adopt new types of applications, data stores, and infrastructure, they still leave residual data in the original location. This results in far-flung silos that can be slow, complex and expensive to analyze. As business demands for analytics rise—along with cloud costs—enterprises need to rationalize how they access and process distributed data. They cannot afford to replicate entire datasets or rewrite software every time they study data in more than one location.

Enterprises can overcome these challenges in two ways. First, many data teams take a surgical approach. They select and integrate only the data subsets, compute engines, and tools they need to support a given analytics project within their performance and cost SLAs. Wherever possible, they place each project’s elements in a single cloud or data center, for example by replicating data sets. They make periodic updates to optimize workloads, for example by moving bits of data, adjusting compute engines, or tuning applications. They monitor results to keep a lid on the cost of data transfers and cloud computing. The surgical approach helps performance, but takes time, effort, and money—and can lead to errors.

The opportunity for data orchestration

Second, enterprises can overcome these challenges by implementing a data orchestration platform. This is a virtualization layer that sits between compute and storage infrastructure in distributed, heterogeneous environments. It enables various compute engines—perhaps running the Presto query engine, or PyTorch machine learning engine on top of a compute cluster—to access data within storage such as Amazon S3, Google Cloud Storage, or HPE. Data orchestration includes a unified namespace, application programming interfaces (APIs), caching, centralized metadata, and a security framework.

Data platform engineers use data orchestration to gain simple, flexible, and high speed access to distributed data for modern analytics and AI projects. Data orchestration helps unify silos and optimize workloads, independently of where compute engines or physical data reside in hybrid and multi cloud environments.

How it works

So what might this look like? As an example, here is a quick summary of the elements at work in the data orchestration platform offered by Alluxio.

A unified namespace gives applications one interface for accessing data across various locations.

APIs support dynamic communication between applications and storage. An application can switch data stores without re-coding.

Caching high-priority data near the compute engine for a given workload helps speed performance and avoid the need for bulk replication.

Centralized metadata for objects such as tables and files, as well as their storage locations, security permissions, etc. simplifies administration and oversight while maintaining consistency between data orchestration and storage systems.

A security framework integrates with Apache Ranger to authenticate the identities of data consumers and authorize their access to data objects.

Use case: migrating to the cloud

Let’s consider how data orchestration would support the common use case of a cloud migration project. When an enterprise wants to migrate its data infrastructure from an on-premises data center to the cloud, or from one cloud service provider to another, accessing data at high speed becomes complex. The data platform engineer might need to keep analytics projects working and make silos of data available to applications and compute engines. They have to juggle many moving parts, making updates on the fly without disrupting existing workloads or business activities.

Data orchestration can help reduce the effort, risk, and cost of supporting such use cases in a hybrid and multi cloud world. Data orchestration offers a consistent virtualization layer between compute and storage, enabling platform migration without impacting the application layer. The unified namespace, APIs, centralized metadata, and security framework make these elements portable between the data center and the cloud, or even multiple clouds.

For example, an existing ML application running on a TensorFlow compute engine, with an Amazon EC2 cluster underneath, can find and start processing new data in the cloud right away. The data platform engineer can grant secure access to the data scientists that need to manipulate this data for their ML application. The application does not know or care that the new data sits in Amazon S3 now rather than Dell storage on premises. Over time, the data orchestration layer observes which data is frequently accessed, and caches that data in a nearby EC2 cluster to help meet SLAs for low latency ML outputs.

This is one of several use cases for data orchestration. Other use cases include the rollout of new AI/ML projects that require data access through different APIs, supporting high volumes, varieties, and velocities of data; assisting mergers and acquisitions that require unified analytics of distributed data; and maintaining application uptime during platform expansions. In many such cases, the surgical approach described above can prove more difficult. Data orchestration can help reduce the effort, risk, and cost of supporting use cases like these.

That is the vision of data orchestration. Can enterprises turn vision into reality, and if so, how? Our next blog on this topic will explore typical architectures that support data orchestration, and the following blog will recommend guiding principles for successful implementations.

About the author

Kevin Petrie is the VP of Research at Eckerson Group. Kevin's passion is to decipher what technology means to business leaders and practitioners. He has invested 25 years in technology, as an industry analyst, writer, instructor, product marketer, and services leader. A frequent public speaker and accomplished writer, Kevin has a decade of experience in data management and analytics. He launched, built and led a profitable data services team for EMC Pivotal in the Americas and EMEA, implementing data warehouse and data lake platforms for Fortune 2000 enterprises. More recently he ran field training at the data integration software provider Attunity, now part of Qlik.

Kevin has co-authored two books, Streaming Change Data Capture: A Foundation for Modern Data Architectures, OReilly 2018, and Apache Kafka Transaction Data Streaming for Dummies, Wiley 2019. He also serves as data management instructor at eLearningCurve. Kevin has a B.A. from Bowdoin College and MBA from the University of California at Berkeley. A bookworm and outdoor fitness nut, Kevin enjoys kayaking, mountain biking and skiing with his wife and three boys.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo