Speeding Up the Atlas Supercomputing Platform with Fluid + Alluxio

Introduction to Unisound and Atlas AI Platform

Unisound is an artificial intelligence company focusing on Internet of Things services. Unisound’s AI technology stacks include the perception and expression capabilities of signals, voices, images, and texts, and the cognitive technologies such as knowledge, understanding, analysis, and decision-making, towards a multi-modal AI system. Atlas is the supercomputing platform supporting all kinds of AI applications including model training and reasoning inferencing.

Unisound has built the industry-leading GPU/CPU heterogeneous computing and distributed file system, called Atlas. This platform provides AI applications with high-performance computing and data access capabilities at a massive scale. Based on the Kubernetes open source architecture, the Unisound team has developed the core features and successfully built an AI supercomputing platform with a floating-point processing capacity of more than 10 PFLOPS (100 million times per second). The platform supports the main machine learning frameworks, and developers can efficiently research and develop core applications such as voice, NLP, big data, multimodal, etc. The platform also serves external customers such as SMBs and research institutions with customized computing and storage capabilities.

Problems and Challenges

On the Atlas platform, computation is decoupled from storage. At present, the interconnections among the storage servers, the computing servers, and between the computing and storage servers are 100GB InfiniBand.

We have built our own high-performance distributed file system, named Lustre, as the storage layer of the Atlas platform. Lustre consists of several petabytes of model training datasets. The Lustre distributed file system is compatible with the POSIX interface so that various deep learning frameworks can directly read data from Lustre. The separation between computing and storage enables them to scale independently, making overall architecture more flexible. However, such architecture encountered problems such as slow data access and network bandwidth bottlenecks. The challenges are as follows:

The I/O Bottleneck

As the number of users grows, there is a large increase in the bandwidth, metadata workload, and server workload with unchanged storage resources. In the storage cluster, there are many workloads running on the same single GPU node, competing for I/O resources. Thus, the entire training cycle gets longer, which greatly reduces the efficiency of the research and development.

Massive Small Files

The second challenge is about the training data set itself. In the noise reduction training, certain users would process terabytes of small files, which puts great pressure on the metadata service of the distributed file system, making it inefficient in reading data. The slow read lowers the overall utilization of the GPU, which increases the overall model training time.

Many Data Types

Many applications run on the Atlas platform with different data formats and file sizes, making it nearly impossible to use one single configuration to fit all services. Based on user behaviors, we found that the most used data is for model training, with the rest for model inference and CPU-intensive data generation applications.

Data Redundancy

The last challenge is the dataset replication on the platform. When the same dataset is used by different users in the same group or in different groups, multiple copies are stored, resulting in a waste of storage space.

The Previous Solution

We want to find a solution that is not only cost-effective but also requires the least architectural changes to deal with the I/O bottleneck and offload the metadata server. Below are several ways we explored.

Put Limitation on Bandwidth

Massive concurrent reads will push the bandwidth to reach the limit, causing storage system freezes or failures. We limit the bandwidth by putting limitations on the bandwidth of each client on the computing node and the UID/GID of each user. However, this method is not flexible and cannot fully utilize GPU resources. When there are two large I/O training jobs running on the same node, due to the bandwidth limitations, both training jobs are limited on reads. These limitations make GPU less utilized because it does not benefit from reading in parallel. We found that the utilization rate of GPU in this scenario is only 40%, which means hardware resources are wasted.

Aggregate Small Files to Large Files

We also took steps to deal with pressure put on the metadata by the massive number of small files. We estimated the number of small files by collecting the number inode of and the total storage size of each user to limit the quantity of the small files. Then, we implemented a series of data aggregation tools, allowing users to aggregate small files into large file formats such as lmdb and tfrecord.

Customize the Task Scheduler

To avoid too many jobs running on the same node at the same time, we customized the plug-ins of the task scheduler by adding scheduling policies that identify the resource usage of the computing node so that it can schedule jobs to idle nodes to avoid the I/O competition when running multiple jobs on the same node. However, this solution doesn’t work when all the computing nodes are overloaded because competition is inevitable.

Tiered Cache

In order to fully utilize the idle hardware resources and reduce the pressure on the storage system, we developed the V1.0 cache solution as a temporary solution. This solution can relieve the storage pressure to a certain extent, but the data management is not automated yet, so it can only be a temporary solution until we find an ultimate solution and build the new architecture.

The New Architecture

In 2020, the Unisound team started to deploy Alluxio and conducted a series of tests, including functionality tests and performance benchmarking. We found that Alluxio can meet our needs and solve our pain points effectively at a lower cost.

Alluxio Fuse provides a POSIX file system interface, which allows users to seamlessly use the distributed cache without changing the applications.
Alluxio supports different types of storage systems, including distributed file systems, object storage, etc. When we introduce new storage to our platform, Alluxio supports it very well, keeping our entire cache architecture stable.
Alluxio provides intelligent cache management. Alluxio’s tiered cache fully makes use of memory, SDD, or HDD, reducing the cost of data-driven applications with elastic scalability.
Alluxio supports Kubernetes or containers deployment, which is consistent with our existing technology stack;
Alluxio provides HA support to ensure the high availability of the distributed cache system

Our previous architecture completely decoupled computing and storage. By introducing Alluxio as a cache layer between computing and storage, users can enjoy fast data access because Alluxio brings the underlying storage to memory or local hard drives on each computing node. The entire platform utilizes the resources of both the distributed file system and the local hard disk.

When deploying Alluxio to production, we encountered problems such as access control and data mounting. Fluid provides a more cloud-native way to use Alluxio and manage data. Just like managing cloud resources, Kubernetes schedules and allocates cached data, which is better than the previous V1.0 cache.

Our ultimate architecture is: Alluxio is responsible for data migration and cache management, moving data from the underlying distributed file system to the local cache of the computing node, providing acceleration to the platform applications; Fluid is responsible for the orchestration of caching and applications. Based on Fluid, the platform can perceive caching and intelligently perform cache management without manual operations.

After implementing the new architecture, we integrated Fluid with our self-developed model training task submission tool, atlasctl, to hide the complexity on the user side as much as possible. Users can create a cache dataset by using atlasctl cache create and specifying parameters such as cache size and cache media, and so on. This tool allows users to pay more attention to the data and the application itself without having to understand the underlying cache infrastructure.

Configurations settings

We introduced Fluid + Alluxio as a new architecture to our platform. During the deployment, we encountered some problems in different scenarios. We immediately reached out to the community, and the community solved our needs in a timely manner. Here are the main features:

hostpath and nonroot Support
Multiple Mount Points Support
Cache Warmup
Performance Tuning

For more detailed settings and performance tuning, please refer to the full-length whitepaper.

Test Scenarios

We tested three scenarios according to the size of the data set. The first type is small files, with the size of a single file under 1M. The second is medium files in several hundred gigabytes, and the third is terabytes of large files.

Noise Reduction

This test uses the DLSE model generated by the Pytorch framework. The number of files in the dataset is about 500,000 and the total size is 183 GB. Memory is used as the cache of Alluxio.

We use a single machine with 10 GPU cards for this test. Based on Pytorch’s native DDP framework for multi-card communication, we compare the two scenarios: read directly from the distributed file system (Lustre), read from the Alluxio cache (Lustre+Alluxio), and read from Alluxio with cache warmup (Alluxio, warm).

We can see that, during the first read, using Alluxio with cache warmup (Alluxio, warm) is nearly 10x faster compared to reading directly from the distributed file system (Lustre). Because when Alluxio reads the first time, it needs to sync the metadata and cache data at the same time, the real advantage of caching is still not reflected yet. In the second read, since the data have all been brought into the cache, the performance depends on the cache hit rate of Alluxio. From the above test results, we can see that there is a significant speedup.

With a faster data read, we also improved overall GPU utilization. By monitoring the utilization, we found that the GPU usage kept at about 90% using Alluxio cache warmup. At the same time, we can store cached data in the memory, effectively offloading the underlying storage.

Optical Character Recognition (OCR)

The second test is a CRNN-based character recognition model based on the Pytorch framework. For more information, please refer to the full-length whitepaper.

Conclusion

By introducing the new architecture of Fluid + Alluxio, we have gained the following benefits:

Speedup model training: through the test results, we can see a significant speedup on training jobs. By achieving data locality, the platform can avoid network bottleneck or resource competition, thereby effectively accelerating the data access in the model training process.
Offload the underlying storage: by enabling local cache, the new architecture brings a better IOPS (Input/Output Operations Per Second), greatly reduces the workload pressure of the underlying storage system, and effectively improves its availability.
Increase GPU cluster utilization: by efficient I/O read, the platform eliminates the data read bottleneck, and also avoids GPU idling while waiting for data, thus improving the GPU utilization rate of the entire cluster.
Avoid I/O competition on the same node: the new architecture fully solves our previous pain points of I/O resource competition on the same node, the storage system I/O bottleneck, and the slow model training.
More efficient cache management: the new architecture is a more cloud-native way to manage the cache. Engineers have changed from simply loading data in memory to managing and monitoring the cache. Kubernetes scheduling can perceive the cache and perform corresponding policy allocations, making jobs more efficient and easier to manage.

Next Steps

Fluid + Alluxio has brought us a lot of benefits. We will keep working closely with the community and contribute to the following work in the future:

We will continuously provide feedback to the community with more test results in more scenarios, and iterate and optimize the performance of Alluxio.
We will test more data types, summarize findings, and provide best practices on performance tuning to the community.
We will add an intelligent cache scheduling feature to Fluid.

About the Authors

Dongdong Lv, Platform Architect at Unisound

Many years of experience in cloud-native open source communities
Responsible for ML platform architecture design, deep learning and AI modeling optimizations and accelerations
Research areas: high-performance computing, distributed file system, distributed caching, etc

Qingsong Liu, Algorithm Researcher at Unisound

Responsible for the R&D of ML platforms and algorithms
Research areas: cloud-native architecture, high-performance computing, voice and vision applications, machine learning algorithms, etc.