GPUs Are Fast, I/O is Your Bottleneck

This article was initially posted on ITOpsTimes.

Unless you’ve been living off the grid, the hype around Generative AI has been impossible to ignore. A critical component fueling this AI revolution is the underlying computing power, GPUs. The lightning-fast GPUs enable speedy model training. But a hidden bottleneck can severely limit their potential – I/O. If data can’t make its way to the GPU fast enough to keep up with its computations, those precious GPU cycles end up wasted waiting around for something to do. This is why we need to bring more awareness to the challenges of I/O bottlenecks.

GPUs are data-hungry

GPUs are blazingly fast and capable of performing tera floating point operations per second (TFLOPS), which is 10~1,000 times faster than CPUs. However, data must be available in GPU memory for GPUs to perform these operations. The faster you load data into GPUs, the quicker it can perform the computation.

Now, a new bottleneck is emerging – GPUs are increasingly starved by slow I/O. This is usually called I/O bound or data stall. I/O means input/output, a term measuring the performance of reading and writing data from the source to its destination. Studies from Google and Microsoft have shown up to 70% of model training time is taken up by I/O. Put another way, your GPUs spend 70% of their time sitting idle, wasting your time and money.

Let’s look at the typical machine learning pipeline. At the beginning of each training epoch, training data is stored on object storage and then moved to the local storage of the GPU instance, which is finally delivered to the GPU memory. Data retrieval through network, copying data between storage tiers, and metadata operations all contribute to the duration of each training epoch.

In the past, it was best to feed data to GPUs locally from NVMe storage. Now, data is distributed across various locations, datasets have outgrown local GPU storage capacity, and GPU speed has increased while I/O has not kept pace. Sadly, many may not even know you are not maximizing the full potential of its GPUs.

Why is it becoming even more important now?

Before we get into solutions, let’s talk about why this is all becoming a bigger deal today.

First, generative AI is having a moment right now. People around the world are getting excited, making businesses take notice. These days, organizations are in a hurry to get AI products to market. So they’re asking a lot from their AI infrastructure and want to see results fast. But it’s often when they finally roll out AI platforms from pilot to production that they realize the infrastructure needs to be optimized.

Second, as the AI hype goes on, GPUs are hard to get. Now, you ‘rent’ them in the cloud, like AWS’s EC2 P4 instances, packing up to 8 NVIDIA Tesla A100 GPUs. Cloud training has become the new norm. Once upon a time, GPUs were used only to process local datasets, but now you will have to move your datasets around or copy data between regions and clouds to make them closer to wherever GPUs are. Developers often find some instances in a region far from where they store their data. This is problematic because geo-separated computation and storage mean slow I/O.

Last but not least, there’s a higher demand for better results at a lower cost. Foundation models and deep learning models require many experiments to determine optimal parameters. Machine learning engineers are doing more experiments because more iterations = better final models. Meanwhile, organizations prioritize ROI and cloud cost optimization, like FinOps. This makes it urgent to resolve the I/O bottleneck to better utilize expensive GPUs.

Architectural considerations

We need to optimize I/O in such a way that the GPU never has to wait for data to perform its computations. Here are key considerations for machine learning and AI infrastructure engineers to get more out of GPUs:

Load data in parallel – Use distributed data engines like DistributedDataParallel (DDP) in PyTorch to load, transform, and normalize datasets in parallel before feeding to GPUs.

Strategically cache data – Accelerate I/O by caching frequently used training data in a high-performance caching layer, like Alluxio, or directly in GPU memory.

Optimize storage format – Partitioning cold and hot data will prevent loading unnecessary data. Use columnar formats like Parquet to efficiently store and compress analytic data to save the I/O bandwidth before loading data to CPU.

Real-time ingest and data collecting using high throughput frameworks – Collect data in parallel from myriad sources from Kafka, Kinesis, or similar data pipelines.

Increase mini-batch size – Allow more efficient parallelization and utilization of GPU compute power and GPU memory during training.

Shard data across GPUs – Distribute data across multiple GPU devices in a scale-out fashion to train models faster.

Continuous monitoring – Monitor the model training performance to identify and alleviate bottlenecks quickly. For example, you can use TensorBoard to see how much time is spent on data loading. Also pay attention to the performance of storage, including throughput, IOPS and metadata performance.

Key takeaways

As deep learning models and datasets grow exponentially, it is critical to build scalable data pipelines that efficiently deliver data to on-premise or cloud GPUs. Be aware of I/O speed and ensure that I/O is not the bottleneck of valuable AI business outcomes. To maximize GPU investment value for your AI projects, your infrastructure teams should proactively consider the I/O optimizations.