Alluxio AI Infra Day 2024
.png)









Building a reliable, high-performance AI/ML infrastructure can be challenging, especially with constrained budget in a multi-GPU world: infrastructure teams have to leverage GPUs wherever they are available. This requires moving data across regions and clouds, which further leads to slow, complex, and expensive remote data access challenges. This white paper introduces the common causes of slow AI workloads and low GPU utilization, how to diagnose the root cause, and offer solutions to the most common root cause of underutilized GPUs.



This article introduces how to leverage Alluxio as a high-performance caching and acceleration layer atop hyperscale data lakes for queries on Parquet files. Without using specialized hardware, changing data formats or object addressing schemes, or migrating data from data lakes, Alluxio delivers sub-millisecond Time-to-First-Byte (TTFB) performance comparable to AWS S3 Express One Zone. Furthermore, Alluxio’s throughput scales linearly with cluster size; a modest 50-node deployment can achieve one million queries per second, surpassing the single-account throughput of S3 Express by 50× without latency degradation.




In this tech talk, Hyun Jung Baek, Staff Backend Engineer at Coupang, will share best practices for leveraging distributed caching to power search and recommendation model training infrastructure.



By leveraging Alluxio Distributed Cache, RedNote eliminated storage bottlenecks causing model training time to exceed SLAs, accelerated cross-cloud model distribution, and lowered model distribution costs



Publicly traded, top 10 global e-commerce company leverages Alluxio Enterprise AI to accelerates training of search and recommendation AI Model with Alluxio and cut AWS S3 API and egress charges by over 50%.