It becomes increasingly more popular among data scientists to train models based on frameworks like TensorFlow on a local server or cluster while using remote shared storages like S3 or Google Cloud Storage to store a massive amount of the input data. This stack provides high flexibility and cost efficiency, especially requires no dev-ops to manage and maintain the data. However, moving data from the remote storage to feed the local training processes can be inefficient for data access patterns with many iterations on the same input.
A common workaround in practice is to copy and distribute the data to the storage local or close to the servers running the model training processes. This process of data preparation, often manual or maintained in scripts, can be slow, error-prone and difficult to manage at scale. In addition, it is also hard to enforce coordination in data sharing across multiple model training jobs (e.g., different instances explore the parameter space at the same time or different members in the same team are working on the same data sets).
How Alluxio Helps
Ideally, this process of training data preparation from remote to local and enabling data sharing should be automated and transparent to the applications. One can deploy a layer of data orchestration like Alluxio to serve the data to TensorFlow to improve the end-to-end model development efficiency. For example, Alluxio can be deployed colocated with the training cluster, exposing the training data through Alluxio POSIX or HDFS compatible interfaces, backed by the mounted remote storage like S3. The training data can be pre-loaded to Alluxio from the remote storage or cached on demand. See documentation for more details.