This is a guest blog by Jowanza Joseph with an original blog source. It is about how he used Alluxio to reduce p99 and p50 query latencies and optimized the overall platform costs for a distributed querying application. Jowanza walks through the product and architecture decisions that lead to our final architecture, discuss the tradeoffs, share some statistics on the improvements, and discuss future improvements to the system.
We are writing several engineering blogs describing the design and implementation of Alluxio master to address this scalability challenge. This is the first article focusing on metadata storage and service, particularly how to use RocksDB as an embedded persistent key-value store to encode and store the file system inode tree with high performance.
Alluxio serves its metadata from a single active master as the primary and potentially multiple standby master for high availability. The master handles all metadata requests and uses a write-ahead log to journal all changes so that we can recover from crashes. The log is typically written to shared storage like HDFS for persistence and availability. Standby masters read the write-ahead log to keep their own state up-to-date. If the primary master dies, one of the standbys can quickly take over for it.
At Alluxio, we believe that in order to fundamentally solve the data access challenges, the world needs a new layer – a data orchestration platform – between computation frameworks and storage systems.
Notice anything new about our websites? That’s right – we are super excited to launch our new website – Alluxio.io!
As we continue our focus on our open source community, one important item on our mind was to rebuild our website to provide better user experience for our community. To that end, you’ll see lots of changes in the Alluxio web experience.
Alluxio is a proud sponsor and exhibitor of Spark+AI Summit in San Francisco.
What’s Spark+AI Summit? It’s the world’s largest conference that is focused on Apache Spark – Alluxio’s older cousin open source project from the same lab (UC Berkeley’s AMPLab – now RISElab).
Alluxio provides a distributed data access layer for applications like Spark or Presto to access different underlying file system (or UFS) through a single API in a unified file system namespace. If users only interact with the files in the UFS through Alluxio, since Alluxio has knowledge of any changes the client makes to the UFS, it will keep Alluxio namespace in sync with the UFS namespace.
As part of the Alluxio 2.0 release, we have moved our RPC framework from Apache Thrift to gRPC. In this article, we will talk about the reasons behind this change as well as some lessons we learned along the way.
In Alluxio 1.x, the RPC communication between clients and servers is built mostly on top of Apache Thrift. Thrift enabled us to define Alluxio service interface in simple IDL files and implement client binding using native Java interfaces generated by Thrift compiler. However, we faced several challenges as we continued developing new features and improvements for Alluxio.
In this meetup, Bin Fan from Alluxio and Wenbo Zhao from Two Sigma co-presented a reference stack (running Alluxio as a data access layer for Apache Spark) that can enable independent and separated compute and storage for big data and machine learning workloads. Two Sigma’s use case is a great example of the benefits of this reference stack for bursting machine learning computation to the public cloud while still being able to access data stored on-premise efficiently. Their data scientists want to leverage the public cloud as a scalable and elastic computation resource to speed up the end-to-end model training process. By using Alluxio as the data access layer co-located with compute in the cloud, their researchers achieved 10x faster end to end processing, which enables them to perform more iterations on their models.
China Unicom is one of the five largest telecom operators in the world. China Unicom’s booming business in 4G and 5G networks has to serve an exploding base of hundreds of millions of smartphone users. This unprecedented growth brought enormous challenges and new requirements to the data processing infrastructure. The previous generation of its data processing system was based on IBM midrange computers, Oracle databases, and EMC storage devices. This architecture could not scale to process the amounts of data generated by the rapidly expanding number of mobile users. Even after deploying Hadoop and Greenplum database, it was still difficult to cover critical business scenarios with their varying massive data processing requirements.