metadata management Archives

Alluxio Journal Evolution – Towards high availability and fault tolerance

March 4, 2022

This talk will introduce and compare the two available implementations of this journal in Alluxio, the first using Zookeeper and the more recent version using Raft.

Tags: alluxio day, metadata management, raft, zookeeper

Metadata Synchronization in Alluxio: Design, Implementation and Optimization

December 14, 2021 By David Zhu

Metadata synchronization (sync) is a core feature in Alluxio that keeps files and directories consistent with their source of truth in under storage systems, thus making it simple for users to reason the data retrieved from Alluxio. Meanwhile, understanding the internal process is important in order to tune the performance. This article describes the design and the implementation in Alluxio to keep metadata synchronized.

Apache Hudi : The Path Forward

October 12, 2021

deep dive into two important areas of active development going forward – table metadata management and caching.

Tags: alluxio day, apache hudi, caching, data lake, metadata management

Scalable Filesystem Metadata Services with RocksDB

July 22, 2019

Alluxio maintainer and founding engineer Calvin Jia presents on Scalable Filesystem Metadata Services with RocksDB at the RocksDB meetup at Twitter.

Tags: alluxio engineering, meetup, metadata management, performance, scale, storage, unified namespace

Building fast and scalable big data and ML platforms at Pinterest and JD.com

June 21, 2019 by Calvin Jia & Yongsheng Wu [Pinterest]

This talk shares our design, implementation and optimization of Alluxio metadata service to address the scalability challenges, focusing on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc.

Tags: aws s3, data, machine learning, meetup, metadata management, performance, scale, tiered storage

Scalable Metadata Service in Alluxio: Storing Billions of Files

May 10, 2019 By Andrew Audibert

We are writing several engineering blogs describing the design and implementation of Alluxio master to address this scalability challenge. This is the first article focusing on metadata storage and service, particularly how to use RocksDB as an embedded persistent key-value store to encode and store the file system inode tree with high performance.
Alluxio serves its metadata from a single active master as the primary and potentially multiple standby master for high availability. The master handles all metadata requests and uses a write-ahead log to journal all changes so that we can recover from crashes. The log is typically written to shared storage like HDFS for persistence and availability. Standby masters read the write-ahead log to keep their own state up-to-date. If the primary master dies, one of the standbys can quickly take over for it.

Building a Distributed Data Access Layer for Analytics on Any Cloud

Data Council SF * April 18, 2019

In this talk, we will focus on Alluxio design, its architecture, data flow and metadata flow. We will dive into the choices in its design space and share the experiences when implementing features like data tiering, storage options and cache eviction policies. We will also share our lessons in design, implementation and operation when working to build an open source distributed storage systems with 900 contributors for 5+ years.

Store 1 Billion Files in Alluxio 2.0

April 9, 2019 By Andrew Audibert

In Alluxio 1.x, the namespace was limited to around 200 million files in practice. Scaling further would cause garbage collection issues due to the limit of the Alluxio master JVM heap size. Also, storing 200 million files would require a large memory footprint (around 200GB) of JVM heap.
To scale the Alluxio namespace in 2.0, we added support for storing part of the namespace on disk in RocksDB. Recently-accessed data is stored in memory, while older data ends up on disk. This reduces the memory requirements for serving the Alluxio namespace, and also takes pressure off of the Java garbage collector by reducing the number of objects it needs to deal with.

Alluxio 2.0 Deep Dive & A Case of Real-time Processing with Spark

Bay Area Meetup * March 12, 2019

We are excited to present Alluxio 2.0 to our community. The goal of Alluxio 2.0 was to significantly enhance data accessibility with improved APIs, expand use cases supported to include active workloads as well as better metadata management and availability to support hyperscale deployments. Alluxio 2.0 Preview Release is the first major milestone on this path to Alluxio 2.0 and includes many new features.

Tag: metadata management