Building a Distributed File System For The Cloud-Native Era

Tags: , , ,

Big Data Bellevue Meetup

May 19, 2022

Today, data engineering in modern enterprises has become increasingly more complex and resource-consuming, particularly because (1) the rich amount of organizational data is often distributed across data centers, cloud regions, or even cloud providers, and (2) the complexity of the big data stack has been quickly increasing over the past few years with an explosion in big-data analytics and machine-learning engines (like MapReduce, Hive, Spark, Presto, Tensorflow, PyTorch to name a few).

To address these challenges, it is critical to provide a single and logical namespace to federate different storage services, on-prem or cloud-native, to abstract away the data heterogeneity, while providing data locality to improve the computation performance. [Bin Fan] will share his observation and lessons learned in designing, architecting, and implementing such a system – Alluxio open-source project — since 2015.

Alluxio originated from UC Berkeley AMPLab (used to be called Tachyon) and was initially proposed as a daemon service to enable Spark to share RDDs across jobs for performance and fault tolerance. Today, it has become a general-purpose, high-performance, and highly available distributed file system to provide generic data service to abstract away complexity in data and I/O. Many companies and organizations today like Uber, Meta, Tencent, Tiktok, Shopee are using Alluxio in production, as a building block in their data platform to create a data abstraction and access layer. We will talk about the journey of this open source project, especially in its design challenges in tiered metadata storage (based on RocksDB), embedded state-replicate machine (based on RAFT) for HA, and evolution in RPC framework (based on gRPC) and etc.

Meetup Group

Big Data Bellevue:


Presentation Slides:


Bin Fan is the PMC maintainer of Alluxio open source. Prior to joining Alluxio as a founding engineer, he worked for Google to build the next-generation storage infrastructure. Bin received his Ph.D. in Computer Science from Carnegie Mellon University on the design and implementation of distributed systems.