How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

April 25, 2025

Hope Wang

In a recent Alluxio-hosted virtual tech talk, Hyun Jun Baek, Staff Backend Engineer at Coupang, presented "How Coupang Leverages Distributed Cache to Accelerate ML Model Training." This blog post summarizes key insights from Hyun's presentation on Coupang's approach to distributed caching and how it has transformed their multi-region, hybrid cloud machine learning platform.

TL;DR: Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

About Coupang

Coupang is a Fortune 200 technology company listed on the NYSE that provides retail, restaurant delivery, video streaming, and fintech services to customers around the world under brands that include Coupang, Coupang Eats, Coupang Play and Farfetch.

AI / ML Platform at Coupang

Machine learning impacts every aspect of commerce at Coupang, enhancing customer experiences across various areas, including product catalog, search and recommendation, pricing, robotics, inventory management, and fulfillment.

Coupang's AI/ML platform offers several core services, including notebooks and ML pipeline authoring, model training, model inference, monitoring and observability, and training and inference clusters.

(Diagram Source: Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172)

Hybrid & Multi-Region Compute & Storage

Coupang’s platform team chose to use both AWS multi-region and on-prem GPU clusters to meet internal demand and requirements for compute resources, efficiency, high I/O throughput, developer experience, and cloud cost optimization.

This hybrid and multi-region approach helps Coupang navigate global GPU shortages to meet the high demand for GPU resources required for machine learning training.

This diagram illustrates the deployment of GPU AI/ML training clusters across a hybrid and multi-region infrastructure. The data lake in the AP region serves as the source of truth for training data. The GPU training clusters are multi-region, including both cloud and on-premises environments.

In each cluster, there is a different set of compute and storage components. For instance, in the cloud, they use a managed Kubernetes service, whereas on-premises, they deploy vanilla Kubernetes.

Challenges with the Multi-Cluster GPU Architecture

This architecture had several challenges:

1. Required Preparation Step (Copy & Validation) Before Training Jobs

With this distributed GPU architecture, Coupang had to introduce a preparation step before scheduling the training jobs. Users had to copy training data from the data lake (object storage) to the cluster where their training jobs were scheduled to run. This process was not only time-consuming but also unreliable, resulting in delays, particularly when transferring large amounts of data.

The need to copy training data to the job’s GPU cluster made it difficult to fully utilize GPU resources across their distributed infrastructure. For example, if a training job was initially assigned to a cluster in the us-region but later needed to be reassigned to a different cluster, then the training data would need to be copied to that cluster before the job could be started.

2. Underutilized GPU Resources due to I/O Bottlenecks

After the training data was copied to the same region as the GPU cluster, it was stored in lower-performance storage solutions. Since the storage does not provide the throughput to fully saturate the GPUs, leading to low GPU utilization.

While the cloud service provider offers parallel file systems with better performance, they are more expensive and do not scale efficiently.

3. Growing Cost and Operational Complexity of Managing Data Silos

Copying data to multiple GPU clusters created data silos, which increased storage costs.

There was significant operational overhead associated with maintaining storage. ML engineers, who are the platform team’s internal users, often fail to delete unnecessary data, leading to frequent issues with a lack of disk space. This not only added management complexity for the platform team but also caused training job failures.

New Architecture with Distributed Caching

Coupang’s new, distributed caching architecture resolved these challenges by:

Providing instant data availability by automatically bringing data from the data lake to each cluster and eliminating the lengthy data preparation step.
Improving GPU utilization across regions by providing flexibility to immediately schedule training jobs on any cluster without copying data.
Delivering more I/O throughput and lower latency than file storage or parallel file systems.

In the cloud, the distributed caching layer is deployed on instance types with NVMe storage, while on-premises deployments uses CPU nodes with NVMe disks.

Distributed caching reduces storage costs because it only needs to cache hot data, rather than storing the entire dataset, thereby reducing storage costs. It also removes operational overhead because the cache automatically handles the data lifecycle, so ML engineers don't need to manually delete unnecessary data.

The Kubernetes operator simplifies deploying and managing the distributed caching solution across their entire GPU architecture, ensuring consistent configurations and accelerating new cluster deployments.

How Does Distributed Caching Work

This diagram illustrates how distributed caching is deployed.

FUSE pods expose a POSIX-compliant filesystem interface to training jobs. These pods are mounted into training job containers via hostPath volumes (e.g., /mnt/cache-fuse), allowing jobs to access both cached data and the underlying data lake directly, without requiring code changes or awareness of the internals of caching.

Each FUSE pod forwards I/O requests to a set of backend worker pods, which handle the actual data reads and writes. These worker pods typically run on instances equipped with NVMe disks, enabling high-throughput access to a pool of local storage for hot data.

When a requested page is not present in the cache (a cache miss), the worker pods retrieve it from the underlying data lake. Once fetched, the data is stored locally for future use, significantly accelerating subsequent accesses.

To maintain consistency and support service discovery, etcd pods manage mount tables and worker membership across the cache cluster. This ensures that data paths remain stable across deployments. For example, files in “bucket A” are always accessible at /data/bucket_a, regardless of cluster or node. This enables seamless portability of training scripts. Model developers (users of the platform) can run their training scripts wherever compute resources are available without modifying any data paths in their code.

Benefits of Adopting Distributed Caching

Benefits for Model Developers

1. Instant Data Availability

For model developers, the new architecture offers immediate job execution, with training jobs starting immediately without waiting for data to be copied or cached.

2. Seamless Access to Data Across Multiple Regions Without Code Change

It provides unified data abstraction with data accessible through the same paths in all clusters, enabling seamless access across regions. This creates code portability, allowing users to run their training scripts wherever compute resources are available without changing code.

3. Improve GPU Utilization

During peak GPU hours, engineers can submit their training jobs to overflow GPU clusters without having to manually copy training data, ensuring higher overall GPU utilization.

4. Faster Training Jobs

According to performance tests, the distributed cache solution provides approximately 40% increased I/O performance compared to parallel file systems offered by a cloud service provider.

Benefits for Platform Engineers

1. Reduced Storage Cost and Operational Overhead

For platform engineers, the new architecture reduces storage costs by avoiding full capacity storage purchases and eliminating duplicate datasets from the data lakes.

No coordination is required for cache space cleanup, as the cache manages itself. An internal tool is developed to help users load caches by themselves, which helps pre-warm the cache to improve I/O throughput during training.

2. Easy Expansion and Operation

The architecture is manageable with Kubernetes, making deployment, scaling, and maintenance across environments seamless.

Summary

Coupang's new distributed caching architecture offers numerous benefits, including faster model training, improved efficiency, reduced storage costs, higher GPU utilization, and lower operational overhead. Additionally, this new architecture enhances the model developer experience by allowing GPU resources to be utilized wherever available, eliminating the time-consuming extra steps required in the original architecture to prepare data for loading onto GPU clusters.

‍

If you're interested in learning more about Alluxio’s distributed caching, schedule a demo with an expert or read more about Alluxio Distributed Caching.

Share this post

Blog

Inferless Slashes AI Model Loading Time by 12x in LLM Serving Infrastructure Using Alluxio

Inferless solved critical I/O bottlenecks in LLM inference infrastructure by implementing Alluxio, achieving 10x faster model loading (from ~200 Mbps to 2+ Gbps), reducing cold start times from minutes to seconds, and significantly improving customer experience.

New Features in Alluxio Enterprise AI 3.6

Learn about the latest features in Alluxio AI 3.6, including Accelerated AI Cold Starts for inference servers, pushdown parquet query acceleration, and more!

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo