Customer Overview: Real-World Embodied AI

Dyna Robotics develops embodied AI systems that learn directly from physical interaction. Unlike models trained on synthetic simulations, Dyna’s models are built from real-world demonstrations—robots folding towels, handling textiles, and performing dexterous manipulation tasks in production environments.

(Source: Dynamism v1 (DYNA-1) Model: A Breakthrough in Performance and Production-Ready Embodied AI)

Each robot session generates synchronized multi-camera video and high-frequency joint telemetry, packaged into HDF5 (H5) trajectory files. At scale, this results in tens of thousands of files and tens of terabytes of new training data every day.

The Challenge: Architectural Continuity vs. Training Performance

As Dyna expanded its GPU footprint—running clusters of NVIDIA H100 machines—training jobs began to stress the object storage access path. Each run might inspect 10,000 to 100,000 files during initialization. Metadata checks and concurrent small-object reads created latency amplification. Even with prefetching logic in the training code, direct reads from object storage slowed training.

To compensate, the team briefly introduced a self-managed NFS layer. Datasets were copied from GCS to NFS servers to improve locality.

But this approach introduced new operational constraints:

Each persistent disk was capped at 64TB.
Data had to be manually sharded across servers.
As training concurrency increased, bandwidth contention led to periodic slowdowns — at times approaching 30% under heavy concurrency.
Dataset lifecycle management became an infrastructure responsibility.

Figure1: The Original Architecture: Complex, Brittle and Bottlenecked

The deeper issue was architectural coupling. Scaling compute required reshaping storage. Dyna didn't need another storage system; they needed a unified data access layer that could bridge their existing GCS pipeline to their high-performance compute.

To address these challenges, Dyna set clear architectural requirements to leverage the otherwise idle local NVMe SSDs attached to GPU instances, preferring a software-only caching approach rather than introducing additional storage infrastructure. At the same time, GPU servers were occasionally terminated or crashed—sometimes with advance notice on GCP, and sometimes without warning on other cloud providers. Any solution would therefore need to tolerate frequent node churn and avoid coupling data availability to individual GPU instances.

With these constraints in mind, Dyna initially experimented with implementing per-node SSD caching logic themselves. However, treating each machine as an isolated cache led to inefficient utilization and duplicated data across nodes. Dyna also evaluated cluster file systems such as Lustre, but ultimately preferred a distributed caching architecture because it avoided full data migration and minimized operational risk.

First, Introducing a Transparent Caching Layer with Alluxio

Alluxio was deployed directly on Dyna’s GPU nodes as a distributed caching layer. Alluxio’s pooled distributed cache model resolved this by turning individual SSDs into a coordinated storage layer. Because Alluxio operates as a caching layer in front of object storage, Dyna retains the ability to fall back to direct object storage access if needed, significantly reducing operational risk compared to fully managed cluster file systems.

Figure2: The Alluxio Architecture: Fast, Scalable and Multi- Cloud Native

1. Transparent Integration

Importantly, GCS remained the system of record. The robot -> GCS -> processing pipeline remained untouched. Alluxio simply presented a POSIX interface to the training code. To the researchers, the training path stayed the same; architecturally, the cluster gained near-local performance.

2. Pooling Local SSDs into a Global Cache

In an example deployment, Dyna operates 16 NVIDIA H100 GPU instances, each with 6TB of local SSD. By dedicating 5.5TB per node to Alluxio, they created an 88TB distributed cache pool. Unlike previous attempts at per-node caching which were inefficient , Alluxio transformed isolated disks into a coordinated data layer with automatic hot-data retention and LRU-based eviction.

Next, Enabling Multi-Cloud GPU Mobility

As GPU demand grew, Dyna expanded into Together AI. This introduced a major cross-cloud challenge: reading training data from GCS across cloud boundaries would incur egress fees of approximately $88 per terabyte.

To avoid recurring cross-cloud penalties, Dyna introduced a neutral object storage layer using Cloudflare R2. While R2 introduces higher latency compared to co-located GCS buckets, Alluxio’s distributed cache absorbs this difference once datasets are warmed, ensuring steady-state training throughput remains unaffected.

The architecture evolved as follows:

Processed training data continues to be written to GCS.
New datasets are replicated once to R2.
GPU clusters in both GCP and Together AI read through Alluxio.
R2’s zero-egress pricing model eliminates repeated cross-cloud data transfer fees.

Critically, the training code remains unchanged. In one environment, Alluxio reads from GCS; in another, it reads from R2. To the application layer, the interface is identical. This abstraction decouples compute from storage location.

As a result, Dyna can now allocate training workloads to whichever cloud offers the best GPU availability or pricing—without rewriting data pipelines, re-ingesting datasets, or reformatting storage. The storage plane stays constant. The compute plane becomes portable.

Beyond primary training datasets, Dyna also maintains a secondary shared bucket for experimental and researcher-generated data. That bucket is mounted via Alluxio in both clouds.

When a researcher uploads data to this shared location, it becomes automatically visible across environments. Alluxio handles caching locally in each cluster. Object storage credentials are centrally managed at the Alluxio coordinator, eliminating the need to distribute keys to every GPU node.

This design not only simplifies operations—it reduces security exposure and credential sprawl.

Operation and Optimizations in Production

Slurm & Docker Automation: Dyna runs on Slurm rather than Kubernetes. Alluxio workers are deployed as Docker containers. Custom startup scripts detect machine reboots and repopulate the cache automatically, maintaining high uptime in a dynamic environment.‍
Metadata Optimization: Since training involves checking status for up to 100,000 files, Dyna optimized metadata access by shifting to directory-level checks and utilizing an application-level Memcached layer to reduce job "cold start" times.‍
Secure Data Sharing: Object storage credentials are centrally managed at the Alluxio coordinator. Researchers can share experimental data via a separate shared bucket, which Alluxio makes immediately visible across clouds without distributing sensitive keys to every individual GPU node.

Summary

By deploying Alluxio as a unified data plane, Dyna Robotics achieved measurable performance, operational, and strategic gains:

30%+ Training Slowdowns Eliminated
Removed periodic I/O bottlenecks that previously caused performance degradation under heavy concurrency, ensuring stable, compute-bound training throughput.
Operational Complexity Removed
Eliminated manual dataset import and sharding around 64TB disk limits, reduced NFS management overhead, and centralized storage credential handling.
True Multi-Cloud GPU Architecture Enabled
Decoupled storage from compute, allowing workloads to move seamlessly between GCP and Together AI without re-ingesting or restructuring datasets.
Proven Fault-Tolerant Data Layer
Maintained the ability to fall back to object storage when needed, reducing operational risk compared to full data migration into cluster file systems.

As a result, Dyna transformed its infrastructure from a storage-constrained pipeline into a scalable, portable AI data platform—allowing the team to focus on advancing embodied AI rather than managing data movement.

Read case study

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo

Additional Case Studies

Fireworks AI implemented Alluxio's distributed data caching solution to power large-scale AI model deployments across the multi-gpu cloud infrastructure behind their Inference Cloud. With Alluxio, model deployment times reduced by over 10X - eliminating inference cold start delays.

View case study

Blackout Power Trading Selects Alluxio to Scale from 5,000 to 100,000+ ML Models

View case study

RedNote Accelerates Model Training & Distribution with Alluxio

View case study