Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds

ODSC WEST 2019

Cloud storage brings great flexibility in management and cost-efficiency to data scientists, but also introduces new challenges related to data accessibility and data locality for machine learning applications. For instance, when the input data is stored in a remote cloud storage like AWS S3 or Azure blob storage, direct data access is often slow and expensive; but manually moving data to the training clusters can be time-consuming, complicated and often require data engineering or ETL pipelines.

This session is designed for data scientists or data engineers who work with remote and possibly multiple data sources in hybrid or multi-cloud environments. We will guide the audience to use Alluxio to greatly simplify the data preparation in these environments, covering the following topics:

-How to setup and create POSIX endpoint for Alluxio service to unify the file system data access to S3, HDFS and Azure blob storage
How to run Apache Spark to read input from and write output to remote storage with Alluxio as the distributed data caching layer
How to run TensorFlow to train models backed by accessing remote input data like access local file system.

Speaker:

Bin Fan, Alluxio
Bin Fan is the founding engineer and VP of Open Source at Alluxio, Inc. Prior to Alluxio, he worked for Google to build the next-generation storage infrastructure. Bin received his Ph.D. in Computer Science from Carnegie Mellon University on the design and implementation of distributed systems.

Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds from Alluxio, Inc.