Architecting Data Orchestration: Four Use Cases

November 16, 2022

Hope Wang

Originally published on Eckerson.com: https://www.eckerson.com/articles/architecting-data-orchestration-four-use-cases

ABSTRACT: This blog explores four use cases for data orchestration and examples of the supporting architectural elements.

Modern analytics projects rely on a hodgepodge of compute clusters, data stores, and pipelines, flung across countries and continents. Enterprises struggle to meet performance SLAs without replicating lots of data or moving and re-coding applications.

As described in our first blog, orchestration software offers an alternative. It helps enterprises both simplify data access and accelerate performance in these hybrid, multi cloud environments. This blog explores four use cases for data orchestration and examples of the supporting architectural elements.

Let’s start with the definition. Data orchestration software presents a single, location-agnostic view of enterprise data to analytics applications and the compute engines on which they run. It connects compute engines to data stores wherever they reside, across data centers, regions, and clouds. And it caches selected subsets of data right next to the compute it needs, helping speed up query response times.

Data orchestration connects compute engines to data stores, across any physical location, to simplify data access and accelerate performance

With data orchestration, an enterprise can optimize performance, unify silos, and reduce risk. Data orchestration includes a unified namespace, open application programming interfaces (APIs), storage tiering, centralized metadata, and a security framework. Enterprises can implement these elements on their own, or use the integrated tool offered by Alluxio.

The unified namespace gives applications a single interface for accessing data across various locations.
Application programming interfaces (APIs) enable open communication between applications and storage.
Storage tiering caches frequently-accessed data or places it on solid state drives (SSDs) to support ultra-fast analytics.
Centralized metadata describes objects such as tables and files, as well as their formats, locations, security requirements, and other characteristics.
The security framework authenticates who data consumers are, authorizes what they are allowed to do, and audits what they do.

Architectural View of Data Orchestration

The devil, of course, lies in the details. So let’s explore use cases to understand what this looks like in practice. The four use cases are analytics in hybrid environments, workload bursts, analytics across clouds, and platform expansions.

Analytics in hybrid environments

Many enterprises base their analytics projects on the cloud, but incorporate data that resides both in the cloud and on premises. Alternatively, enterprises might collect data on the cloud but process it using compute engines on premises. Data orchestration helps analytics teams query and manipulate data across hybrid environments like these as if the storage and compute elements were all in the same location.

For example, a financial-services organization might have customer records in a Hadoop File System on premises and merchant records in Amazon S3. To reduce fraud, the data team might need to correlate data from both these sources, as well as third-party credit agencies, on a real-time basis before processing large transaction requests.

The financial-services organization uses the storage tiering capabilities of its data orchestration tool to pre-fetch recent transaction histories to cache within Amazon EC2. When a customer requests a large transaction, such as a $1,000 ATM withdrawal, a TensorFlow-based machine learning (ML) model uses the EC2 compute cluster to correlate the various data points and assess the fraud risk of this anomalous transaction. Data orchestration provides the unified namespace that connects the ML model and EC2 cluster to S3 storage, using open APIs, centralized metadata, and a security framework.

Workload bursts

Many analytics projects need to support extra users, lower latency or higher throughput at certain times. Quarter-end financial reporting, real-time analysis of Cyber Monday pricing, or training of deep-learning models to recognize images, are all examples of projects that temporarily increase processing requirements. Data orchestration helps accommodate workload bursts like these by spinning up remote compute capacity on demand. This capacity might come from spare servers on premises, or—more likely—elastic cloud compute clusters.

For example, a chip manufacturer might need to train PyTorch-based ML models to recognize images of semiconductor wafers. To identify production errors, the data might need to train their models on thousands of images that reside in a Cloudera Hadoop data lake on premises. This requires more compute capacity than is available on premises. The data team uses the unified namespace within their data orchestration tool to connect the PyTorch models and Azure Virtual Machine (VM) clusters to the data lake. The data orchestration layer also provides the supporting APIs, metadata, and security framework.

Analytics across clouds

As enterprises mature with their cloud strategies, they tend to diversify and adopt a second cloud provider. They might do this to reduce compute charges, avoid vendor lock-in, meet data sovereignty requirements with a certain location, or take advantage of specialized analytics and AI/ML tools offered by just one provider. In these cases, the data platform engineer can use data orchestration to have two or more cloud platforms process data, wherever it resides. They can avoid migrating data or rewriting applications.

For example, a B2B software company might have European customer records on Google Cloud Storage in Brussels and North American customer records on Amazon S3 in Phoenix. Distributing data in this fashion enables them to meet sovereignty requirements and take advantage of local price differences between Google and Amazon. The B2B software company uses data orchestration to query this data for global analytics projects such revenue dashboards, financial reports, and comparisons of sales performance between regions. Data orchestration unifies distributed data and helps optimize both performance and cost.

Data orchestration also can help this B2B software company configure a multi-tenant environment for advanced analytics projects. Each data science team has its own compute cluster, isolated from the others, with its own SLAs and billing arrangements.

Platform expansion

When a BI report, AI/ML application, or other analytics project gains traction with an initial team or business unit, demand can snowball across the enterprise. The data platform engineer might need to add users, applications, compute engines, datasets, and data stores, any or all of which might reside in different locations. They also might support a platform expansion by migrating datasets from legacy data centers to new cloud platforms. Data orchestration helps these new, changing, or moving elements find and work with each other.

For example, an e-commerce company might need to expand its operations into Latin America. To ensure the right performance levels for customers in major markets such as Brazil, this e-commerce company might need to add local Azure VM clusters in São Paulo. Their data team uses data orchestration to connect these compute engines, along with local versions of website content, to Azure Blob Storage in Dallas. Data orchestration helps add these elements to a unified namespace with minimal reconfiguration of existing elements. Data orchestration also can help the e-commerce company migrate data over time to new storage in São Paulo.

These four examples demonstrate the breadth of possibilities that data orchestration creates. The next and final blog in our series will offer guiding principles to increase the odds of success.

About the author

Kevin Petrie is the VP of Research at Eckerson Group. Kevin’s passion is to decipher what technology means to business leaders and practitioners. He has invested 25 years in technology, as an industry analyst, writer, instructor, product marketer, and services leader. A frequent public speaker and accomplished writer, Kevin has a decade of experience in data management and analytics. He launched, built and led a profitable data services team for EMC Pivotal in the Americas and EMEA, implementing data warehouse and data lake platforms for Fortune 2000 enterprises. More recently he ran field training at the data integration software provider Attunity, now part of Qlik.

Kevin has co-authored two books, Streaming Change Data Capture: A Foundation for Modern Data Architectures, OReilly 2018, and Apache Kafka Transaction Data Streaming for Dummies, Wiley 2019. He also serves as data management instructor at eLearningCurve. Kevin has a B.A. from Bowdoin College and MBA from the University of California at Berkeley. A bookworm and outdoor fitness nut, Kevin enjoys kayaking, mountain biking and skiing with his wife and three boys.

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo