deep learning Archives

Top Tips and Tricks for PyTorch Model Training Performance Tuning [2023]

July 22, 2023 By Hope Wang, Beinan Wang and Chunxu Tang

Get the latest and greatest tips to accelerate your PyTorch model training for machine learning and deep learning. PyTorch, an open-source machine learning framework, has become the de facto choice for many organizations to develop and deploy deep learning models. Model training is the most compute-intensive phase of the machine learning pipeline. It requires continuous … Continued

Accelerating Machine Learning / Deep Learning in the Cloud: Architecture and Benchmark

December 7, 2021

This whitepaper introduces how to speed up end-to-end distributed training in the cloud using Alluxio to accelerate data access. With the help of Alluxio, loading data from cloud storage, training and caching data can be done in a transparent and distributed way as a part of the training process. This whitepaper also demonstrates how to set up and benchmark the end-to-end performance of the training process, along with a comparison of other popular approaches.

Tags: benchmark, cache, cloud, data orchestration, deep learning, distributed training, machine learning, performance, storage

Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid

December 13, 2020

Unisound focuses on Artificial Intelligence services for the Internet of Things. It is an artificial intelligence company with completely independent intellectual property rights and the world’s top intelligent voice technology. Atlas is the Deep Learning platform within Unisound AI Labs, which provides deep learning pipeline support for hundreds of algorithm scientists. This talk shares three real business training scenarios that leverage Alluxio’s distributed caching capabilities and Fluid’s cloud native capabilities, and achieve significant training acceleration and solve platform IO bottlenecks. We hope that the practice of Alluxio & Fluid on Atlas platform will bring benefits to more companies and engineers.

Tags: atlas, data orchestration, data orchestration summit, deep learning, fluid

Deep Learning at Alibaba Cloud with Alluxio – Running PyTorch on HDFS

June 19, 2020 By Yang Che (Alibaba)

Google’s TensorFlow and Facebook’s PyTorch are two Deep Learning frameworks that have been popular with the open source community. Although PyTorch is still a relatively new framework, many developers have successfully adopted it due to its ease of use. By default, PyTorch does not support Deep Learning model training directly in HDFS, which brings challenges … Continued

Efficient Model Training in the Cloud with Kubernetes, TensorFlow, and Alluxio

May 22, 2020 By Rong Gu (Nanjing University) and Yang Che (Alibaba)

A collaboration of Alibaba, Alluxio, and Nanjing University in tackling the problems of Deep Learning model training in the cloud. Our goal was to reduce the cost and complexity of data access for Deep Learning training in a hybrid environment, which resulted in over 40% reduction in training time and cost.

Tag: deep learning