Resource Hub
.png)



ALLUXIO NEW YORK MEETUP
The most innovative organizations like Uber, Twitter, and others have moved to disaggregated stacks – a separate tier for computational frameworks like Spark and Presto and a separate tier for Storage. And the need for more compute flexibility is making users move towards hybrid clouds.
In this meetup, Dipti and HY presented a new approach to hybrid analytical workloads using Alluxio, an open source data orchestration layer, which sits between compute and storage layer. Applications like Apache Spark or TensorFlow can then seamlessly access multiple disparate data sources with consistent performance using data locality and abstraction that the data orchestration tier brings.
Haoyuan Li (H.Y.), Alluxio
Haoyuan is the Founder and CTO of Alluxio. He graduated with a Computer Science Ph.D. from the AMPLab at UC Berkeley. At the AMPLab, he co-created and led Alluxio (formerly Tachyon), an open source virtual distributed file system. Before UC Berkeley, he got a M.S. from Cornell University and a B.S. from Peking University, all in Computer Science.
Dipti Borkar, Alluxio
Dipti Borkar is the VP of Product & Marketing at Alluxio with over 15 years experience in data and database technology across relational and non-relational. Prior to Alluxio, Dipti was VP of Product Marketing at Kinetica and Couchbase. Dipti holds a M.S. in Computer Science from the UC San Diego, and an MBA from the Haas School of Business at UC Berkeley.
.jpeg)

.jpeg)
Today, real-time computation platform is becoming increasingly important in many organizations. In this article, we will describe how ctrip.com applies Alluxio to accelerate the Spark SQL real-time jobs and maintain the jobs’ consistency during the downtime of our internal data lake (HDFS). In addition, we leverage Alluxio as a caching layer to dramatically reduce the workload pressure on our HDFS NameNode.
.jpeg)

.jpeg)
The Alluxio-Presto sandbox is a docker application featuring installations of MySQL, Hadoop, Hive, Presto, and Alluxio. The sandbox lets you easily dive into an interactive environment where you can explore Alluxio, run queries with Presto, and see the performance benefits of using Alluxio in a big data software stack.
.jpeg)

.jpeg)
Here in New York, at the AWS Summit, we are super excited to announce that Alluxio 2.0 is here, our most major release since the Alluxio launch. A couple months ago, we released 2.0 Preview - which included some of the capabilities, but 2.0 now includes even more, to continue building on to our data orchestration approach for the cloud.
.jpeg)

.jpeg)
This article aims to provide a different approach to help connect and make distributed files systems like HDFS or cloud storage systems look like a local file system to data processing frameworks: the Alluxio POSIX API. To explain the approach better, we used the TensorFlow + Alluxio + AWS S3 stack as an example.



Alluxio is a proud sponsor and exhibitor at the Presto Summit in San Francisco. If you missed the conference, don’t worry we’ve got you covered!



Open Source data orchestration for ai, big data, and cloud
Haoyuan Li presents at Beijing Meetup on open source data orchestration and the value of leveraging Alluxio with rising trends driving the need for a new architecture. Four big trends driving this need: Separation of compute & storage, hybrid-multi cloud environments, rise of object store and self-service data across the enterprise.
Separation of compute and storage creates new challenges in how data is managed and orchestrated across frameworks, clouds, and storage systems. Utilizing a unified data orchestration platform simplifies your data’s cloud journey.



ALLUXIO COMMUNITY OFFICE HOUR
Kubernetes is widely used to orchestrate computation with improved flexibility and portability for computation in public or hybrid cloud environments across infrastructure providers. However, running data-intensive workloads introduces challenges such as efficiently moving data to compute frameworks, accessing data from multiple or remote clouds, and co-locating data with compute.
Alluxio solves these problems as a new data orchestration layer bridging the gap between data locality with improved performance and data accessibility for analytics workloads in Kubernetes, and enables portability across storage providers.
In this Office Hour:
- Overview of Alluxio and the cloud use case with Spark in Kubernetes
- How to set up Alluxio and Spark to run in Kubernetes
- Open Session for discussion on any topics such as solving the separation of compute and storage problem, and more



ALLUXIO BAY AREA MEETUP
This talk was presented by Alluxio’s top contributor and PMC Maintainer Calvin Jia at the Alluxio bay area Meetup.
This talk shares our design, implementation and optimization of Alluxio metadata service to address the scalability challenges, focusing on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc.
.jpeg)

.jpeg)
Cloud has changed the dynamics of data engineering as well as the behavior of data engineers in many ways. This is primarily because a data engineer on premise only dealt with databases and some parts of the hadoop stack. In the cloud, things are a bit different. Data engineers suddenly need to think different and broader. Instead of being purely focused on data infrastructure, you are now almost a full stack engineer (leaving out the final end application perhaps). Compute, containers, storage, data movement, performance, network — skills are increasing needed across the broader stack. Here are some design concept and data stack elements to keep in mind.



Over the years of working in the big data and machine learning space, we frequently hear from data engineers that the biggest obstacle to extracting value from data is being able to access the data efficiently. Data silos, isolated islands of data, are often viewed by data engineers as the key culprit or public enemy №1. There have been many attempts to do away with data silos, but those attempts themselves have resulted in yet another data silo, with data lakes being one such example. Rather than attempting to eliminate data silos, we believe the right approach is to embrace them.



As the data ecosystem becomes massively complex and more and more disaggregated, data analysts and end users have trouble adapting and working with hybrid environments. The proliferation of compute applications along with storage mediums leads to a hybrid model that we are just not accustomed to. With this disaggregated system data engineers now come across a multitude of problems that they must overcome in order to get meaningful insights.
.jpeg)

.jpeg)
This article walks through the journey of a startup HashData in Beijing to build a cloud-native high-performance MPP shared-everything architecture leveraging object storage as the data persistence layer and Alluxio as a data orchestration layer in the cloud. we will illustrate how HDW leverages Alluxio as the data orchestration layer to eliminate the performance penalty introduced by object storage while benefiting from its scalability and cost-effectiveness.
.jpeg)

.jpeg)
Traditionally, if you want to run a single Spark job on EMR, you might follow the steps: launching a cluster, running the job which reads data from storage layer like S3, performing transformations within RDD/Dataframe/Dataset, finally, sending the result back to S3. You end up having something like this. If we add more Spark jobs across multiple clusters, you could have something like this.
.jpeg)

.jpeg)
Discontinuity in big data infrastructure drives storage disaggregation, especially in companies experiencing dramatic data growth after pivoting to AI and analytics. This data growth challenge makes disaggregating storage from compute attractive because the company can scale their storage capacity to match their data growth, independent of compute. This decoupled mode allows the separation of compute and storage, enabling users to rightsize hardware for each layer. Users can buy high-end CPU and memory configurations for the compute nodes, and storage nodes can be optimized for capacity. This whitepaper is a continuation of Unlock Big Data Analytics Efficiency with Compute and Storage Disaggregation on Intel® Platforms