Tech Talk Slide Deck

Unified Big Data Analytics: Any Stack, Any Cloud

Tags: , , , ,

The big data stack has heavily evolved over the past few years with an explosion of data frameworks starting with MapReduce and expanding to Apache Spark, Presto, Hive on the structured data side as well as TensorFlow, Caffe on AI and ML side. In addition, the approach to managing and storing data has evolved as well starting from HDFS and now moving to newer approaches like object stores. With all the possible combinations of accessing data, data engineering has become increasingly complex, particularly in the hybrid and multi-cloud environments. Users are increasingly adding a new layer to their data stack that unifies files and objects and provides data locality across separated compute and storage environments.

This is the fundamental problem Alluxio solves. Alluxio is an open-source virtual distributed file system that provides a unified data access layer for hybrid and multi-cloud deployments. Alluxio enables distributed compute engines like Spark, Presto or Machine Learning frameworks like TensorFlow to transparently access different persistent storage systems (including HDFS, S3, Azure and etc) while actively leveraging in-memory cache to accelerate data access. Developed originally from UC Berkeley AMPLab as research project “Tachyon”, Alluxio has more than 900 contributors and is used by over 100 companies worldwide with the largest production deployment over 1000 nodes.

This presentation focuses on how Alluxio helps the big data analytics stack to be cloud-native. The trending Cloud object storage systems provide more cost-effective and scalable storage solutions but also different semantics and performance implications compared to HDFS. Applications like Spark or Presto will not benefit from the node-level locality or cross-job caching when retrieving data from the cloud object storage. Deploying Alluxio to access cloud solves these problems because data will be retrieved and cached in Alluxio instead of the underlying cloud or object storage repeatedly.

Bin Fan and Dipti Borkar from Alluxio will present an overview of Alluxio’s core concepts, architecture, data flow, as well as production use cases. They will also present the architecture to combine big data analytics with Alluxio with use cases from major internet companies including JD.com, Baidu, Tencent and NetEase at a scale of hundreds of nodes in production, and their lessons learned to operate this architecture at scale.