Building High-performance Data Access Layer for Model Training and Model Serving for LLM

Bringing a large language model from its initial training to deployment requires numerous systems and components. At Zhihu, we grappled with a multi-cloud, cross-region AI platform, requiring an efficient solution to facilitate the rapid training and delivery of models for production use cases. This led us to adopt Alluxio, the high-performance data access layer for LLM. This blog provides an in-depth look at Zhihu’s challenges, journey, and solution for LLM training and deployment. Through adopting Alluxio, we’ve significantly enhanced model training performance by 2 to 3 times and can deploy updated models every minute instead of hours or days. Also, our GPU utilization has doubled, infrastructure and operation costs have been halved, and we have established a resilient, efficient infrastructure capable of meeting our escalating AI demands.

1. About Zhihu

Zhihu (NYSE: ZH) is a leading online content community in China where people come to find solutions, make decisions, seek inspiration, and have fun. It provides a question-and-answer website and app that matches user questions with high-quality, trustworthy answers from other users. We have 400 million users, 100 million MAU, and 54 billion monthly views currently. 

As engineers at Zhihu, we continually build and maintain the backend infrastructure to meet business needs. Specifically, my team operates an AI platform that trains and serves models for core services in production, including Q&A, recommendation, and search functionality. Integrating AI-powered capabilities to assist users seeking information or specific answers is not just advantageous – it is essential.

Most recently, we have had a high-priority task to enable Large Language Models (LLMs) to address user queries on particular topics. These models extract and compile user-generated answers and elements from our vast question list, presenting users with a comprehensive summary of viewpoints. This feature aims to greatly enhance user experience and engagement, highlighting the importance of our AI platform to Zhihu’s overall business.

2. Zhihu’s AI Platform Overview

2.1 Machine Learning Pipeline Overview and Multi-Cloud Architecture

The above figure shows a machine learning pipeline we implemented to train, reuse, evaluate, manage, and deploy machine learning models. This pipeline starts with ingesting new training data and ends with receiving feedback on how the newly trained model performs.

This blog will focus on the model training, deployment, and inference phase in this pipeline, as illustrated in the diagram below.

To achieve high availability, capacity expansion, and cost efficiency, we’ve adopted an architecture to leverage multiple clouds located in different geographical locations.

  • Training clouds: Multiple GPU clusters are deployed for large-scale distributed training in different cloud regions.
  • Offline cloud (offline storage and training): The offline HDFS storage is the source of truth of persisted data, including data for analytics platform and AI platform. This cloud is also used for offline model training.
  • Online cloud (inference and online applications): This cloud hosts the model and serves user-facing downstream applications on Zhihu’s website/app, such as comments, answers, searches, and recommendations.

These clouds are interconnected via dedicated networks that serve a number of critical services for cross-cloud communication.

2.2 Model Training

Model training takes place in both clouds, and the input of massive datasets is stored on HDFS and object storage. The computation engine for large-scale distributed training is PyTorch and Spark. Once training is complete, models are persisted on HDFS as considerably large files. For example, the size of a single model may exceed tens or even hundreds of GB. 

To effectively train models, engineers focus on optimizing the model training time and GPU utilization rate.

2.3 Model Deployment

After models are trained, they must be transferred from the offline storage HDFS via the dedicated network and deployed in the online cloud to serve production queries. In addition, these models will be refreshed based on the latest user feedback in the pipeline. Swift deployment of these updated models to production systems is vital, as the speed significantly enhances the quality and relevancy of the inference results.

In this phase, the key goal is to keep stable traffic of the dedicated network bandwidth, as well as the speed of model updates.

2.4 Model Inference

During our model inference phase, downstream applications, such as comments, answers, and recommendations, are user-facing. 

This phase is latency-sensitive. The inference latency greatly impacts the quality of Zhihu’s downstream applications like Q&A and comment.

3. Requirements and Challenges of the AI Platform

3.1 Challenges in Model Training: Training Performance and Underutilized GPU

We’re particularly focused on the end-to-end time of model training. The main bottleneck we encounter is the low data throughput. Our highly concurrent training jobs demand high I/O to keep the GPU optimally utilized. The question arises: how do we achieve high data throughput to keep the GPU constantly busy?

Moreover, since we need to read large-scale training data from object storage, we would like to find a replacement for the s3fs-fuse that can offer POSIX-based access.

3.2 Challenges in Model Deployment: Model Update Speed, Overloaded HDFS, Network Congestion 

Serving the latest models from storage to the inference cluster is essential to our pipeline. Fast model updates are a must for us because it significantly influences the accuracy of model inference within production systems, thus impacting both the Zhihu user experience and our revenue generation. So, how do we ensure the latest, most updated models are accessible quickly when reading them cross-cloud?

Our model updates require high-concurrent reads. Hundreds or even thousands of containers would simultaneously access a single model file. The traffic volume can reach up to 1Tb/sec at peak times. This concurrent access can overload the HDFS storage.

Furthermore, network congestion is a concern. The size of LLM can reach tens or even hundreds of GB. During model serving, the model files must be accessed from the offline HDFS across the network to the online cloud. With tens to hundreds of containers concurrently reading files on HDFS, the network can easily be congested, disrupting other cross-cloud services and leading to a significant number of timeouts or failures.

4. Journey at Zhihu: From HDFS Replication to Union Store, to Alluxio

4.1 Five Years Ago: HDFS with Replication

Initially, our across-cloud model serving involved deploying a new HDFS cluster and copying data from the offline to the online cloud. The process comprised model training on the offline HDFS cluster, model replication to the online HDFS cluster, and then model reading by online containers for deployment. This solution, while easing dedicated line traffic, brought about issues such as high maintenance and development overhead, manual data clean-up, architectural pressure due to a triple replication system, and user overhead. These challenges led us to create our inhouse tool – UnionStore.

4.2 Two Years Ago: UnionStore, An In-house Storage Unification Tool

Our in-house tool, UnionStore, serves as a unified storage solution, providing a standard S3 protocol for HDFS data access and using object storage for cross-cloud caching. UnionStore is employed for both model training and deployment.

In model training, UnionStore functions as an HDFS proxy in the offline cloud, enabling HDFS mounting to local directories via s3fs-fuse for model training. For model deployment, it acts as a cross-cloud cache in the online cloud. When a file read request is made, UnionStore checks if the file is in object storage, reads directly if available, or uploads from offline HDFS if not.

Compared to the previous HDFS replication solution, UnionStore offers advantages like S3 protocol support, automatic file caching, unified file view with real-time HDFS metadata, and server cost savings by replacing an HDFS cluster with object storage. It also solves POSIX’s HDFS read issue.

After running UnionStore for two years, some problems emerged with the expansion of the AI platform, such as strong dependence on HDFS, high CPU consumption, performance bottleneck of object storage, and long time for first-time file reading.

These pain points led us to face a choice: continue iterating on UnionStore, or find an alternative solution that can perfectly replace UnionStore’s use cases. Given our limited developer resources, we opted for the latter.

4.3 Today: Alluxio

In search of a replacement for UnionStore, we need to meet the functionality of both model training and model deployment. Moreover, the performance for both phases needs to be sufficiently robust.

After a thorough research for solutions, we found Alluxio well-suited to our use cases for several reasons:

  • No application change: Compared to other solutions, Alluxio can function as a transparent data access layer. Users don’t need to write model files into other storage systems. They can maintain the status quo and write into HDFS directly.
  • Metadata and data caching: Alluxio supports customizable caching of metadata and data. This means that when reading cached files, the process is entirely unaffected by HDFS. At present, our UnionStore QPS is around 20K-30K. Caching metadata can significantly reduce the pressure on HDFS NameNode in the offline cloud.
  • Extensible to support multiple storages: Alluxio supports a variety of Under File System (UFS) types beyond HDFS, such as object storage. This offers robust support for our model training stage.
  • Multiple popular access interfaces: Alluxio’s S3 API is fully compatible with the S3 protocol, meaning that the cost of transitioning our model deployment from UnionStore to Alluxio is practically negligible. In addition, Alluxio’s Fuse offers local metadata caching and data caching, leading to better performance compared to the s3fs-fuse previously used, which perfectly suits our model training needs.
  • Vibrant open-source community: The Alluxio community is very active. During our research, we get prompt responses from helpful community members while encountering issues.

5. Alluxio as the High-performance Data Access Layer

5.1 Overall Architecture

We deploy one Alluxio cluster in each cloud, utilizing high-performance NVME disks to cache data on HDFS and object storage. It provides an acceleration service for large-scale data access. The Alluxio clusters act as a unified acceleration solution for large-scale data access to both model training and model deployment.

5.2 Model Training with Alluxio

For the model training in the offline cloud, the Alluxio cluster is configured to mount the under storage. After mounting, Alluxio offers transparent access to data in HDFS and object storage. The training dataset reads are through Alluxio. Hot data is cached in Alluxio to accelerate data access. The trained model is directly written to HDFS for persistent storage.

We have set Alluxio to leverage the local NVME of the training cluster. We chose Alluxio Fuse because it provides POSIX access, and can leverage memory and disk for metadata caching and data caching, maximizing the use of idle physical resources on GPU machines. Alluxio Fuse is employed to cache data and metadata on HDFS and object storage locally. See the below diagram for detailed architecture. 

5.3 Model Deployment with Alluxio

The Alluxio cluster in the online cloud caches model files and uses the S3 proxy to provide read-only to accelerate model deployment.

During model updates, Alluxio supports very high concurrency for serving models, fetching the most up-to-date model from offline training clusters to online inference clusters. The model deployment and inference now get on-demand data access for the latest models without copies.

We have fine-tuned Alluxio to adapt to our data access pattern. Since our model files expire quickly, we planned the capacity of the storage size of the Alluxio cluster and managed the storage cost of NVMe. To address the problem of high concurrency, we set the number of file copies allowed by Alluxio to unlimited and used short-circuit reads. To avoid cache penetration, we developed our own cache warm-up policy.

6. Empirical Results

6.1 Model Training: 2-3x Faster Training, 2x GPU Utilization

Regarding performance, with caching based on the read-only data access pattern for model training, the end-to-end training pipelines are accelerated. Compared to the UnionStore + s3fs-fuse, we achieved a performance increase of 2-3 times. This improvement led to a 60% acceleration in model training and a 250% increase in training data reading speed.

From the I/O perspective, Alluxio provides a higher data throughput. The GPUs are now kept busy. Our GPU utilization has increased by 2x with Alluxio.

6.2 Model Deployment: Deploy Updated Models Every Minute Instead of Hours or Days

In terms of test results, we used a 100GB file for single-thread read tests, averaging the results. As shown in the table below, Alluxio’s speed far exceeds HDFS and UnionStore, and it also solves the issue where UnionStore does not support reading during caching.

StorageDisk / MemPreload speed (MB/sec)Cold read
Warm read (MB/sec)Support read during cachetime on 100GB file cold read (seconds)time on 100GB file warm read
AlluxioNVMe2802801600~ 4096M/secYes36060

In production, we see a maximum 10x acceleration in model deployment speed, as shown in the following figure. We can now deploy updated models every minute instead of hours or days, providing agility to online model serving and impacting the experience of millions of Zhihu users.

6.3 Stable Network without Congestion

Alluxio can serve hundreds of models concurrently to thousands of servers accessing the same model, so we have solved the network congestion problem.

In terms of stability, during HDFS fluctuations or master node switching, with Alluxio’s data caching and metadata caching, we can continue to provide data to applications without interruption.

6.4 50% Reduced Cost

Compared to UnionStore, Alluxio allows us to save hundreds of thousands each year. These savings include costs associated with model serving instances, GPUs, and engineering operations. Overall, we’ve achieved a 50% cost reduction compared to our previous solution with UnionStore.

7. Conclusion and Future Work

To summarize, this blog post details the evolution of our AI platform architecture over the past five years: from using the open-source HDFS storage system, to in-house data storage unification tool, and, finally, choosing Alluxio as the high-performance data access layer to tackle our technical challenges. As a result, we’ve achieved 2x GPU utilization, 50% reduced infrastructure and operations costs, and accelerated model deployment and update times from several hours to minutes.

We are very excited about Alluxio’s long-term product roadmap, especially its new-generation architecture, which can support our need for accessing a massive number of small files. We are more confident in supporting AI applications facing the upcoming wave of artificial intelligence.

We’d like to express our gratitude to the Alluxio community. Moving forward, we hope to deepen our collaboration in accelerating analytics queries.

About the Author

Mengyu Hu is a software engineer in the data platform team at Zhihu.

Chengkun Jia is the head of the data platform team at Zhihu.