This article describes how Alluxio can accelerate the training of deep learning models in a hybrid cloud environment when using Intel’s Analytics Zoo open source platform, powered by oneAPI. Details on the new architecture and workflow, as well as Alluxio’s performance benefits and benchmarks results will be discussed.
1. Deep Learning in Hybrid Environments
Architecture Evolution to Hybrid Mode
Traditionally, data processing and analytics systems were designed, built, and operated with compute and storage services as one monolithic platform, residing in an on-premises data warehouse. While simple to manage and performant, this architecture with deeply coupled storage and compute is often challenging to provide applications elasticity and scale more resources for one type without scaling the other.
More users are moving towards a hybrid model, combining resources from both cloud and on-premises environments. This model practices an alternative architecture to leave the data where it resides, typically in the on-premises data warehouse, but launch a separate compute layer as needed. The hybrid model allows compute and storage resources to be scaled independently, leading to numerous advantages:
- No resource contention: On-premise machines can be fully utilized by storage services because there is no competition for resources from compute services
- No compute downtime: There are no idle compute resources because clusters are launched on demand in the cloud
- No data duplication: Long-running batch jobs or ephemeral ad-hoc queries can share the same set of data without making separate copies
Challenges to Deliver Fast I/O for Deep Learning
Although hybrid architecture provides flexibility and cost advantages, there are additional challenges for deep learning analytics when training on big data. Deep learning training involves numerous trials of different neural network models and different hyper-parameters using the same set of data. In addition, the size of the training datasets has been continuously growing. There is a huge overhead cost in loading all this data for each trial when the training data is stored in a remote storage system.
A common practice today to manage data across hybrid environments is to copy data to a storage service residing in the compute cluster before running deep learning jobs. Typically, users use commands like “distCP” to copy data back and forth between the on-premise and cloud environments. While this looks easy, it typically requires a manual process which is slow and error-prone.
To address the I/O challenges of training deep learning models in hybrid environments and leverage Intel’s oneAPI performance optimizations, we developed and tested a new architecture/workflow integrating Alluxio in the Analytics Zoo platform, powered by oneAPI.
2. A New Architecture & workflow with Analytics Zoo and Alluxio
What is Analytics Zoo
Analytics Zoo, powered by oneAPI, is an open source unified analytics and AI platform developed by Intel to seamlessly unite several deep learning applications into an integrated pipeline. Users can transparently scale from running sample jobs on a laptop to processing production scale big data on large clusters.
- Writing TensorFlow or PyTorch inline with Spark code for distributed training and inference
- Native deep learning (TensorFlow/Keras/PyTorch/BigDL) support in Spark ML Pipelines
- Directly running Ray programs on big data clusters through RayOnSpark.
- Plain Java/Python APIs for (TensorFlow/PyTorch/BigDL/OpenVINO) Model Inference
What is Alluxio
Alluxio is an open-source data orchestration layer for data analytics. It provides high performance to data analytics or machine learning systems like Analytics Zoo, serving as a distributed caching layer to prevent reading data repeatedly from remote data sources. Compared to other solutions, Alluxio provides the following advantages in a hybrid cloud environment with “zero-copy burst” capabilities to burst data processing to the cloud:
- Compute-driven data on-demand: When a storage system is mounted onto Alluxio, only its metadata is initially loaded. Alluxio only caches data as application requests for it. This on demand behavior allows burst data processing to the cloud, eliminating the need to manually copy data from an on-premise cluster to the cloud.
- Data Locality: Alluxio intelligently caches data close to applications, replicates hot data, or evicts stale data based on data access patterns.
- Data Elasticity: Alluxio can be elastically scaled along with the analytics frameworks, including container orchestrated environments.
- Common APIs for data access: Alluxio provides data abstraction with different common APIs including the HDFS API, S3 API, POSIX API and others. Existing applications built for analytical and AI workloads can run directly on this data without any changes to the application itself
Setup and Workflow
The following figure is the architecture that integrates Alluxio with Analytics Zoo for fast and efficient deep learning workloads:
On-premise or remote data stores are mounted onto Alluxio. Analytics Zoo application launches deep learning training jobs by running Spark jobs, loading data from Alluxio through the distributed filesystem interface. Initially, Alluxio has not cached any data, so it retrieves it from the mounted data store and serves it to the Analytics Zoo application while keeping a cached copy amongst its workers. This first trial will run at approximately the same speed as if the application was reading directly from the on-premise data source. In subsequent trials, Alluxio will have a cached copy, so data will be served directly from the Alluxio workers, eliminating the remote request to the on-premise data store. Note that the caching process is transparent to the user; there is no manual intervention needed to load the data into Alluxio. However, Alluxio does provide commands like “distributedLoad” to preload the working dataset to warm the cache if desired. There is also a “free” command to reclaim the cache storage space without purging data from underlying data stores.
3. Benchmark Results
This section summarizes Alluxio’s performance testing and benchmark results for the integrated workflow.
We run experiments in a 7-node Spark cluster (1 instance as the master node and the remaining as worker nodes) deployed by AWS EMR. The benchmark workload is inception v1 training, using the ImageNet dataset stored in AWS S3 in the same region.
As the baseline, the Spark cluster is directly accessing the dataset from the S3 bucket. This is compared to a setup where Alluxio is installed on the Spark cluster, with the S3 bucket mounted as its under filesystem.
The following table details the specific environment configurations:
|EC2 Instance Type||r5.8xlarge|
|Number of vCPU per instance||32|
|Size of memory per instance||256GB|
|Operation System||Ubuntu 18.04|
|Apache Spark version||2.4.3|
|Analytics Zoo version||0.7.0|
We measured data loading performance when running an inception training on ImageNet data by using Analytics Zoo. The measured time includes training data and test data loading time.
The average load time with and without Alluxio is 579 and 369 seconds, respectively. This is approximately a 1.5x speedup when Analytics Zoo uses Alluxio for loading the ImageNet training and testing data. Note that, the input data is located in S3 in the same region of the compute.
The following figure shows that with Alluxio, variation in performance (15.9 seconds) is also much lower than the baseline variation (32.3 seconds). This indicates that Alluxio not only helps the average loading time but also makes the performance more consistent.
By leveraging Alluxio as a data layer on Analytics Zoo, the hybrid cloud solution provides acceleration of data loading in Analytics Zoo applications and deep learning analytics on big data systems. Our Alluxio’s internal performance benchmark testing shows this architecture is approximately a 1.5x speedup when Analytics Zoo uses Alluxio for loading the ImageNet training and testing data.
Continued advancements in artificial intelligence applications have brought deep learning to the forefront of a new generation of data analytics development. There is an increasing demand from organizations to apply deep learning technologies to their big data analysis pipelines. On behalf of the entire Alluxio open source community, we encourage our readers to give this solution a try and invite you to ask questions in our community slack channel whenever you encounter any issues.
Special thanks to Intel’s Jennie Wang and Louie Tsai for their valuable Analytics Zoo’s technical consultation & support.