Alluxio is a proud sponsor and exhibitor of Spark+AI Summit in San Francisco. If you missed the conference, don’t worry we’ve got you covered!

What’s Spark+AI Summit? It’s the world’s largest conference that is focused on Apache Spark - Alluxio’s older cousin open source project from the same lab (UC Berkeley’s AMPLab - now RISElab).
Overview of the Conference by the Numbers
- Spark+AI Summit originally Spark Summit started off in 2013 with around 200 attendees. This is their 6th year, and from our observation there were over 3000 attendees!
- Of the 3000+ attendees, we had over 1500+ interactions and more than 500 in-depth conversations with folks already using or interested in learning about Alluxio
- 100 lucky attendees won our drones!
What We Learned
- Adopting a cloud strategy is a top priority for most organizations at the event
- Many organizations are experiencing challenges with hybrid cloud, because they are not able to access data in the public cloud and their own data warehouse efficiently
- Machine learning is on the rise, but SQL queries over big data is still the bread and butter of most organizations
- Kubernetes is changing the landscape of big data analytics. In the next 3-6 months, we will see a wave of organizations move to deploying big data workloads with container orchestration systems
- Attendees love to win drones ;) Find us at the next event: Strata Data Conference in New York

Reasons to try the Apache Spark, Alluxio, and S3 Stack
- This stack is cloud-native
- Apache Spark and Alluxio are open source
- S3 is cost-effective and scalable driving down devops costs with high performance
Learn more: 10X Acceleration of Spark with Alluxio Case Study , Get started with Spark and Alluxio in 5min , Download Alluxio
All of the sessions are recorded and will be viewable here.
Thanks to everyone for stopping by the Alluxio booth and the great conversations!

Additional resources:
- Community office hour (virtual): Running Apache Spark with Alluxio on Amazon EMR
- Got questions? Chat with Alluxio experts on Slack
.png)
Blog

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:
- Time-consuming data preparation and data copy/movement
- Difficulty utilizing GPU resources efficiently
- High and growing storage costs
- Excessive operational overhead maintaining storage for localized data silos
To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.