Kubernetes, Alluxio and the disaggregated analytics stack

TL;DR: First the news - Alluxio support for K8s Helm charts now available! K8s is a certified environment for Alluxio. Now the take away- Alluxio brings back data locality for the disaggregated analytics stack in K8s. How? Read on.

There’s no arguing the rise of containers in real-world deployments over the past few years. Containers simplify running applications in any environment and Kubernetes further transforms the way software and applications are deployed and scaled agnostic of environments. In fact, Kubernetes is increasingly seen as a key technology that enables not only easy resource orchestration in the data center but also in hybrid and multi-cloud environments. While containers and Kubernetes works exceptionally well for stateless applications like web servers and even completely self-contained databases like mongoDB, Couchbase and others, the stack looks a bit different in the world of advanced analytics and AI.

The modern analytical stack is a highly disaggregated stack. Unlike traditional databases or data warehouses, the new stack is split apart.

Pick a data lake or two or three to store data (S3, GCS, HDFS, etc.)
Pick a computational framework to analyze data (Apache Spark, Presto, Hive, TensorFlow etc.)
Make sure all the other dependencies like the catalog service are available (Hive Metastore, AWS Glue, KMS, etc.)

Challenges running the disaggregated analytics stack in K8s

Kubernetes greatly simplifies the complexity of deploying so many distributed systems together. And overtime advanced analytics running on K8s clusters will become the norm. But there are still a few critical gaps to make this modern analytical stack effective.

Challenge #1 - No shared data access / caching layer in the K8s cluster

K8s is a fantastic container orchestration technology and with tools like Helm charts, operators and more, deployment can be greatly simplified. However, for data intensive workloads like advanced analytics typically need data sharing between jobs to be effective so that data from one job can be easily be accessed by the next job. Without a data access / caching layer, the data needs to be written back to the data lake and then needs to be read back into the K8s cluster again significantly slowing down data pipelines.

Challenge #2 - Lost data locality

With data being stored in S3 or other cloud object stores or on-prem in Hadoop, to perform analytics within the K8s cluster, users have a couple of options. Data needs to either be accessed remotely (meaning poor performance) or needs to be manually copied into the K8s cluster (meaning a lot more additional DevOps and management on a per workload basis). And oftentimes this will carry the burden of managing the differences between those copies which can be hard. The ideal solution is for data locality to be recreated in this disaggregated stack.

Challenge #3 - No data elasticity for elastic compute

The beauty of K8s is the flexibility it gets to even the most complex compute workloads - scale up, down, upgrade, restart etc. based on need and demand. But again the dependency on data being available to compute remains for data-intensive workloads. To scale compute in, out, up or down, the data within K8s also needs to be able to do the same to leverage the power of the flexibility K8s brings.

Data Orchestration can solve these challenges by syncing data into the K8s cluster and allowing for seamless in-memory data access and flexibility to share data across jobs and scale in or out as needed.

The news

Alluxio has had a docker container for a while, but with Alluxio version 2.1, Kubernetes becomes a first class environment for Alluxio with advanced testing and certification of K8s. We are now seeing more production deployments with Alluxio and compute frameworks like Presto and Spark in K8s.

Also new with Alluxio version 2.1, Alluxio is available for deployment via Helm Charts.

What are Helm Charts?

Helm helps you manage Kubernetes applications — Helm Charts help you define, install, and upgrade even the most complex Kubernetes application. Charts are easy to create, version, share, and publish — so start using Helm and stop the copy-and-paste. Learn more here: https://helm.sh

Get started

To learn more about deploying Alluxio with Helm Charts, read the docs.

You can get started with trying out Alluxio using our docker sandbox tutorial!

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo