
We are delighted by the success of the inaugural Data Orchestration Summit on Nov. 7, 2019! Organized by Alluxio, this one-day event was sold out with nearly 400 attendees! Data engineers, cloud engineers, data scientists joined the talks of 24 industry leaders from all over the globe to share their experiences building cloud-native data and AI platforms. All session recordings and slides are now available.
Key Announcements
Haoyuan Li, founder and CTO of Alluxio, opened the summit with his talk - Orchestrate a Data Symphony, where he discusses the key challenges and trends impacting data engineering in relation to building modern data and AI platforms, and explore the concept of Data Orchestration.
In the Alluxio tech talks, founding engineers Calvin Jia, Bin Fan, and Gene Pang dive into Alluxio 2 Series' key features in open source, community updates, and the latest innovations bringing Alluxio open source into the world of structured data.
Session highlights
The featured talks for the Summit highlighted how leading companies architect their data and AI platforms through the data orchestration approach, leveraging open source technologies such as Alluxio, Apache Spark, Presto, and more. Some session highlights include:
- Orchestrate a Data Symphony - Haoyuan Li, Alluxio
- Enterprise Distributed Query Service powered by Presto & Alluxio across clouds at WalmartLabs - Ashish Tadose, Walmart
- How to Run Fast Presto Analytics with Alluxio in Cloud - a Production Experience - Danny Linden, Ryte
- Alluxio tech talks: What’s New in Alluxio 2 - Calvin Jia & Bin Fan, and Alluxio Innovations for Structured Data - Gene Pang
- Open Source Panel: how to create an open source project - Ben Lorica, O’Reilly; Tobi Knaup, D2iQ; Maxime Beauchemin, Preset; Haoyuan Li, Alluxio
- Data Orchestration for Analytics and AI workloads at DBS Bank - Carlos Queiroz, Development Bank of Singapore (recording will soon be available here)
What's next?
- Join the conversations on the community slack channel!
- Given the strong interest, we’re bringing back the hands-on lab, so stay tuned!

Cheers!
Amelia and Bin
Data Orchestration Summit Co-Chairs
.png)
Blog

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:
- Time-consuming data preparation and data copy/movement
- Difficulty utilizing GPU resources efficiently
- High and growing storage costs
- Excessive operational overhead maintaining storage for localized data silos
To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.