Orchestrating Data for the Cloud World with Alluxio 2.0

Today, I’m thrilled to announce the GA of Alluxio 2.0, Alluxio’s biggest release to date (see our Release Notes & Release Blog) with over 900 PRs. Thank you to the 1000+ open source developers, our amazing team, and users, customers, partners, who together made this possible! 2.0 is a major step towards realizing our vision of building an open source implementation of a Data Orchestration system for analytics and machine learning in the cloud. 

In this blog, I will share the motivation behind starting this project back at UC Berkeley AMPLab, and how the system has evolved as the broader ecosystem transitioned into the cloud world.

Back in 2013, I was a Ph.D. student at UC Berkeley AMPLab advised by Professor Ion Stoica and Scott Shenker. By then, Hadoop was dominating big data ecosystem and becoming the de facto industry solution; whereas in AMPLab researchers started to build the Berkeley Data Analytics Stack (or BDAS), which had successfully spawned widely popular open-source projects like Apache Spark and Apache Mesos. With an intrinsic interest in data and distributed systems, I co-created the Alluxio open source project (formerly named Tachyon) as the data layer in BDAS. 

Initially, Tachyon was commonly used together with Apache Spark to save and share off-heap RDDs (Resilient Distributed Dataset) in memory. Soon it became clear that this research project had great potential as a standardized data access layer across multiple storage systems. Regardless of Hadoop or BDAS, running analytics in data warehouse was the assumption and focus at that time; the concept of cloud computing was more a buzzword than a reality for most enterprises.

The data ecosystem has evolved dramatically since I founded the company in 2015. Today, most organizations are either already in the cloud (whether single, hybrid or multi-cloud) or in the process of adopting a cloud strategy. Unlike serving colocated analytics jobs in traditional data warehouses, data service in the cloud becomes more distant (e.g., transferred from S3), siloed (e.g., spread across multiple different regions or storage services), and often with large variance in performance. 

Designed to provide data abstraction to decouple compute and storage, Alluxio is ideally positioned in the cloud world as an orchestration platform for data (like Kubernetes is the orchestration platform for containers). It enables data engineers to run analytics and AI/ML workloads on clouds of their choice with magnitudes higher performance and interacts with data on-demand without having to worry about where the data resides or the performance implication.

Today, Alluxio is deployed and trusted by industry leading companies such as China Unicom, Development Bank of Singapore, Tencent, and many more. Some of the largest deployments have more than 1,000 nodes in a single Alluxio cluster, powering critical infrastructures globally. At the same time, our community has grown to 1000+ contributors, and our software can handle billions of files and manage petabyte scale data.

I am more excited today than ever. Alluxio 2.0 is a big step towards realizing the vision of being the data orchestration layer enabling new technology stacks and serving organizations to unlock the power of data for all. Welcome to download the software and try it out

Enjoy hacking, creating, and cheers to the future!