Orchestrating Data for the Cloud World with Alluxio 2.0

July 11, 2019

Haoyuan Li

Today, I’m thrilled to announce the GA of Alluxio 2.0, Alluxio’s biggest release to date (see our Release Notes & Release Blog) with over 900 PRs. Thank you to the 1000+ open source developers, our amazing team, and users, customers, partners, who together made this possible! 2.0 is a major step towards realizing our vision of building an open source implementation of a Data Orchestration system for analytics and machine learning in the cloud.

In this blog, I will share the motivation behind starting this project back at UC Berkeley AMPLab, and how the system has evolved as the broader ecosystem transitioned into the cloud world.

Back in 2013, I was a Ph.D. student at UC Berkeley AMPLab advised by Professor Ion Stoica and Scott Shenker. By then, Hadoop was dominating big data ecosystem and becoming the de facto industry solution; whereas in AMPLab researchers started to build the Berkeley Data Analytics Stack (or BDAS), which had successfully spawned widely popular open-source projects like Apache Spark and Apache Mesos. With an intrinsic interest in data and distributed systems, I co-created the Alluxio open source project (formerly named Tachyon) as the data layer in BDAS.

Initially, Tachyon was commonly used together with Apache Spark to save and share off-heap RDDs (Resilient Distributed Dataset) in memory. Soon it became clear that this research project had great potential as a standardized data access layer across multiple storage systems. Regardless of Hadoop or BDAS, running analytics in data warehouse was the assumption and focus at that time; the concept of cloud computing was more a buzzword than a reality for most enterprises.

The data ecosystem has evolved dramatically since I founded the company in 2015. Today, most organizations are either already in the cloud (whether single, hybrid or multi-cloud) or in the process of adopting a cloud strategy. Unlike serving colocated analytics jobs in traditional data warehouses, data service in the cloud becomes more distant (e.g., transferred from S3), siloed (e.g., spread across multiple different regions or storage services), and often with large variance in performance.

Designed to provide data abstraction to decouple compute and storage, Alluxio is ideally positioned in the cloud world as an orchestration platform for data (like Kubernetes is the orchestration platform for containers). It enables data engineers to run analytics and AI/ML workloads on clouds of their choice with magnitudes higher performance and interacts with data on-demand without having to worry about where the data resides or the performance implication.

Today, Alluxio is deployed and trusted by industry leading companies such as China Unicom, Development Bank of Singapore, Tencent, and many more. Some of the largest deployments have more than 1,000 nodes in a single Alluxio cluster, powering critical infrastructures globally. At the same time, our community has grown to 1000+ contributors, and our software can handle billions of files and manage petabyte scale data.

I am more excited today than ever. Alluxio 2.0 is a big step towards realizing the vision of being the data orchestration layer enabling new technology stacks and serving organizations to unlock the power of data for all. Welcome to download the software and try it out!

Enjoy hacking, creating, and cheers to the future!

Share this post

Blog

New Features in Alluxio Enterprise AI 3.6

How Coupang Leverages Distributed Cache to Accelerate Machine Learning Model Training

Coupang, a Fortune 200 technology company, manages a multi-cluster GPU architecture for their AI/ML model training. This architecture introduced significant challenges, including:

Time-consuming data preparation and data copy/movement
Difficulty utilizing GPU resources efficiently
High and growing storage costs
Excessive operational overhead maintaining storage for localized data silos

To resolve these challenges, Coupang’s AI platform team implemented a distributed caching system that automatically retrieves training data from their central data lake, improves data loading performance, unifies access paths for model developers, automates data lifecycle management, and extends easily across Kubernetes environments. The new distributed caching architecture has improved model training speed, reduced storage costs, increased GPU utilization across clusters, lowered operational overhead, enabled training workload portability, and delivered 40% better I/O performance compared to parallel file systems.

Uptycs Chooses Alluxio to Power GenAI Natural Language Analytics at Terabyte Scale

Suresh Kumar Veerapathiran and Anudeep Kumar, engineering leaders at Uptycs, recently shared their experience of evolving their data platform and analytics architecture to power analytics through a generative AI interface. In their post on Medium titled Cache Me If You Can: Building a Lightning-Fast Analytics Cache at Terabyte Scale, Veerapathiran and Kumar provide detailed insights into the challenges they faced (and how they solved them) scaling their analytics solution that collects and reports on terabytes of telemetry data per day as part of Uptycs Cloud-Native Application Protection Platform (CNAPP) solutions.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo