At Alluxio, we believe that in order to fundamentally solve the data access challenges, the world needs a new layer – a data orchestration platform – between computation frameworks and storage systems.
Notice anything new about our websites? That’s right – we are super excited to launch our new website – Alluxio.io!
As we continue our focus on our open source community, one important item on our mind was to rebuild our website to provide better user experience for our community. To that end, you’ll see lots of changes in the Alluxio web experience.
Alluxio is a proud sponsor and exhibitor of Spark+AI Summit in San Francisco.
What’s Spark+AI Summit? It’s the world’s largest conference that is focused on Apache Spark – Alluxio’s older cousin open source project from the same lab (UC Berkeley’s AMPLab – now RISElab).
Alluxio provides a distributed data access layer for applications like Spark or Presto to access different underlying file system (or UFS) through a single API in a unified file system namespace. If users only interact with the files in the UFS through Alluxio, since Alluxio has knowledge of any changes the client makes to the UFS, it will keep Alluxio namespace in sync with the UFS namespace.
As part of the Alluxio 2.0 release, we have moved our RPC framework from Apache Thrift to gRPC. In this article, we will talk about the reasons behind this change as well as some lessons we learned along the way.
In Alluxio 1.x, the RPC communication between clients and servers is built mostly on top of Apache Thrift. Thrift enabled us to define Alluxio service interface in simple IDL files and implement client binding using native Java interfaces generated by Thrift compiler. However, we faced several challenges as we continued developing new features and improvements for Alluxio.
China Unicom is one of the five largest telecom operators in the world. China Unicom’s booming business in 4G and 5G networks has to serve an exploding base of hundreds of millions of smartphone users. This unprecedented growth brought enormous challenges and new requirements to the data processing infrastructure. The previous generation of its data processing system was based on IBM midrange computers, Oracle databases, and EMC storage devices. This architecture could not scale to process the amounts of data generated by the rapidly expanding number of mobile users. Even after deploying Hadoop and Greenplum database, it was still difficult to cover critical business scenarios with their varying massive data processing requirements.
In Alluxio 1.x, the namespace was limited to around 200 million files in practice. Scaling further would cause garbage collection issues due to the limit of the Alluxio master JVM heap size. Also, storing 200 million files would require a large memory footprint (around 200GB) of JVM heap.
To scale the Alluxio namespace in 2.0, we added support for storing part of the namespace on disk in RocksDB. Recently-accessed data is stored in memory, while older data ends up on disk. This reduces the memory requirements for serving the Alluxio namespace, and also takes pressure off of the Java garbage collector by reducing the number of objects it needs to deal with.
In a recent blog, we discussed the ideation, design and new features in Alluxio 2.0 preview. Today we are thrilled to announce another new revolutionary project that the Alluxio engineering team has been hard at work on for the past year – the Alluxio Virtual Reality (VR) client.
In the early 2000s, big data was born, and technology companies were racing to create the next-gen compute frameworks or storage systems geared towards the requirements brought about by big data. By the time I was a first year Ph.D. student at UC Berkeley’s AMPLab in 2011, numerous advances in big data related technologies such as Apache Spark was emerging. Through working on Apache Spark and getting exposed to cutting-edge technologies it became clear that sharing data among data driven applications with different compute frameworks and moving data across storage systems would become the bottleneck for any organization that wants to extract value from their data. To solve these challenges, I created Alluxio (formerly Tachyon), which for the lack of a defined category I called it a virtualized distributed file system in my original thesis.