Resource Hub

Presentation

Presentation

Building fast and scalable big data and ML platforms at Pinterest and JD.com

ALLUXIO BAY AREA MEETUP

This talk was presented by Alluxio’s top contributor and PMC Maintainer Calvin Jia at the Alluxio bay area Meetup.

This talk shares our design, implementation and optimization of Alluxio metadata service to address the scalability challenges, focusing on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc.

Blog

Blog

Effective Data Engineering in the Cloud World

Cloud has changed the dynamics of data engineering as well as the behavior of data engineers in many ways. This is primarily because a data engineer on premise only dealt with databases and some parts of the hadoop stack. In the cloud, things are a bit different. Data engineers suddenly need to think different and broader. Instead of being purely focused on data infrastructure, you are now almost a full stack engineer (leaving out the final end application perhaps). Compute, containers, storage, data movement, performance, network — skills are increasing needed across the broader stack. Here are some design concept and data stack elements to keep in mind.

Blog

Blog

Embracing Data Silos the journey through a fragmented data world

Over the years of working in the big data and machine learning space, we frequently hear from data engineers that the biggest obstacle to extracting value from data is being able to access the data efficiently. Data silos, isolated islands of data, are often viewed by data engineers as the key culprit or public enemy №1. There have been many attempts to do away with data silos, but those attempts themselves have resulted in yet another data silo, with data lakes being one such example. Rather than attempting to eliminate data silos, we believe the right approach is to embrace them.

Blog

Blog

Hybrid Environments for Data Analytics is a Possibility

As the data ecosystem becomes massively complex and more and more disaggregated, data analysts and end users have trouble adapting and working with hybrid environments. The proliferation of compute applications along with storage mediums leads to a hybrid model that we are just not accustomed to. With this disaggregated system data engineers now come across a multitude of problems that they must overcome in order to get meaningful insights.

Blog

Blog

Starburst Presto and Alluxio announce strategic OEM partnership

Announcing the OEM partnership with Alluxio and Starburst Data, the company behind Presto, the fastest growing SQL query engine in a disaggregated world.

Blog

Blog

Building a cloudnative analytics MPP database with Alluxio

This article walks through the journey of a startup HashData in Beijing to build a cloud-native high-performance MPP shared-everything architecture leveraging object storage as the data persistence layer and Alluxio as a data orchestration layer in the cloud. we will illustrate how HDW leverages Alluxio as the data orchestration layer to eliminate the performance penalty introduced by object storage while benefiting from its scalability and cost-effectiveness.

Blog

Blog

Alluxio on EMR Fast Storage Access and Sharing for Spark Jobs

Traditionally, if you want to run a single Spark job on EMR, you might follow the steps: launching a cluster, running the job which reads data from storage layer like S3, performing transformations within RDD/Dataframe/Dataset, finally, sending the result back to S3. You end up having something like this. If we add more Spark jobs across multiple clusters, you could have something like this.

Blog

Blog

Speeding Big Data Analytics on the Cloud with InMemory Data Accelerator

Discontinuity in big data infrastructure drives storage disaggregation, especially in companies experiencing dramatic data growth after pivoting to AI and analytics. This data growth challenge makes disaggregating storage from compute attractive because the company can scale their storage capacity to match their data growth, independent of compute. This decoupled mode allows the separation of compute and storage, enabling users to rightsize hardware for each layer. Users can buy high-end CPU and memory configurations for the compute nodes, and storage nodes can be optimized for capacity. This whitepaper is a continuation of Unlock Big Data Analytics Efficiency with Compute and Storage Disaggregation on Intel® Platforms

Blog

Blog

Distributed Data Querying with Alluxio

This is a guest blog by Jowanza Joseph with an original blog source. It is about how he used Alluxio to reduce p99 and p50 query latencies and optimized the overall platform costs for a distributed querying application. Jowanza walks through the product and architecture decisions that lead to our final architecture, discuss the tradeoffs, share some statistics on the improvements, and discuss future improvements to the system.

Presentation

Presentation

Decoupling Compute and Storage for Data Workloads

Carlos Queiroz of DBS presents on how to decouple compute and storage for data workloads using Alluxio. At DBS, they chose to decouple their compute and storage because it was too hard to scale, was not flexible, and there were costs associated.

Blog

Blog

Scalable Metadata Service in Alluxio: Storing Billions of Files

Alluxio provides a unified namespace where you can mount multiple different storage systems and access them through the same API. To serve the file system requests to operate on all the files and directories in this namespace, Alluxio masters must handle the file system metadata at a scale of all mounted systems combined. We are writing several engineering blogs describing the design and implementation of Alluxio master to address this scalability challenge. This is the first article focusing on metadata storage and service, particularly how to use RocksDB as an embedded persistent key-value store to encode and store the file system inode tree with high performance.

Blog

Blog

Data Orchestration The Missing Piece in the Data World

At Alluxio, we believe that in order to fundamentally solve the data access challenges, the world needs a new layer - a data orchestration platform - between computation frameworks and storage systems.

Blog

Blog

Welcome to Alluxioio

Notice anything new about our websites? That’s right - we are super excited to launch our new website - Alluxio.io! As we continue our focus on our open source community, one important item on our mind was to rebuild our website to provide better user experience for our community. To that end, you’ll see lots of changes in the Alluxio web experience.

Blog

Blog

Recap SparkAI Summit 2019

Alluxio is a proud sponsor and exhibitor of Spark+AI Summit in San Francisco. What’s Spark+AI Summit? It’s the world’s largest conference that is focused on Apache Spark - Alluxio’s older cousin open source project from the same lab (UC Berkeley’s AMPLab - now RISElab).

Blog

Blog

Two Ways to Keep Files in Sync Between Alluxio and HDFS

Alluxio provides a distributed data access layer for applications like Spark or Presto to access different underlying file system (or UFS) through a single API in a unified file system namespace. If users only interact with the files in the UFS through Alluxio, since Alluxio has knowledge of any changes the client makes to the UFS, it will keep Alluxio namespace in sync with the UFS namespace.

Blog

Blog

Moving From Apache Thrift to gRPC A Perspective From Alluxio

As part of the Alluxio 2.0 release, we have moved our RPC framework from Apache Thrift to gRPC. In this article, we will talk about the reasons behind this change as well as some lessons we learned along the way. In Alluxio 1.x, the RPC communication between clients and servers is built mostly on top of Apache Thrift. Thrift enabled us to define Alluxio service interface in simple IDL files and implement client binding using native Java interfaces generated by Thrift compiler. However, we faced several challenges as we continued developing new features and improvements for Alluxio.

Blog

Blog

Two Sigma Meetup Recap Achieving Compute and Storage Independence for Datadriven Workloads

In this meetup, Bin Fan from Alluxio and Wenbo Zhao from Two Sigma co-presented a reference stack (running Alluxio as a data access layer for Apache Spark) that can enable independent and separated compute and storage for big data and machine learning workloads. Two Sigma’s use case is a great example of the benefits of this reference stack for bursting machine learning computation to the public cloud while still being able to access data stored on-premise efficiently. Their data scientists want to leverage the public cloud as a scalable and elastic computation resource to speed up the end-to-end model training process. By using Alluxio as the data access layer co-located with compute in the cloud, their researchers achieved 10x faster end to end processing, which enables them to perform more iterations on their models.

Presentation

Presentation

Achieving compute and storage independence for data-driven workloads

TWO SIGMA OPEN SOURCE MEETUP

TSOS meetups focus on the open source projects that Two Sigma cares most about, from projects we generated in-house then open sourced to large external open source projects that we depend on to do our work. This time, Wenbo Zhao (Two Sigma) and Bin Fan (Alluxio) will be presenting on how Two Sigma uses Alluxio to make data-intensive compute independent of the storage beneath.

The rise of computation-intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads — one in which compute scales independently from storage. However, while enabling scaling elasticity, it introduces new problems — how do you co-locate data with compute, how do you unify data across multiple remote data centers, how do you keep storage and I/O egress costs down and many more.

In this meetup, the Wenbo and Bin will present a new approach to making data-intensive compute independent of the storage beneath using open source project Alluxio, an open-source distributed file system, which sits between compute and storage layer. Applications like Apache Spark or TensorFlow can then seamlessly access multiple disparate data sources with consistent performance without knowing where the data is actually persisted.

Wenbo will present why Two Sigma needed to disaggregate compute and storage and how they decided to adopt the Spark + Alluxio + HDFS architecture.

Bin will present a deep dive into the Alluxio open source project, a distributed file system, including the reference architecture, the data & metadata paths to serve requests from compute from remote understores and the compute API’s supported for accessing data from Alluxio.

Blog

Blog

China Unicom Uses Alluxio and Spark to Build New Computing Platform to Serve Mobile Users

China Unicom is one of the five largest telecom operators in the world. China Unicom’s booming business in 4G and 5G networks has to serve an exploding base of hundreds of millions of smartphone users. This unprecedented growth brought enormous challenges and new requirements to the data processing infrastructure. The previous generation of its data processing system was based on IBM midrange computers, Oracle databases, and EMC storage devices. This architecture could not scale to process the amounts of data generated by the rapidly expanding number of mobile users. Even after deploying Hadoop and Greenplum database, it was still difficult to cover critical business scenarios with their varying massive data processing requirements.

Blog

Blog

Store 1 Billion Files in Alluxio 2.0

In Alluxio 1.x, the namespace was limited to around 200 million files in practice. Scaling further would cause garbage collection issues due to the limit of the Alluxio master JVM heap size. Also, storing 200 million files would require a large memory footprint (around 200GB) of JVM heap. To scale the Alluxio namespace in 2.0, we added support for storing part of the namespace on disk in RocksDB. Recently-accessed data is stored in memory, while older data ends up on disk. This reduces the memory requirements for serving the Alluxio namespace, and also takes pressure off of the Java garbage collector by reducing the number of objects it needs to deal with.

Blog

Blog

Unified Data Access In Virtual Reality

In a recent blog, we discussed the ideation, design and new features in Alluxio 2.0 preview. Today we are thrilled to announce another new revolutionary project that the Alluxio engineering team has been hard at work on for the past year - the Alluxio Virtual Reality (VR) client.

On Demand Videos

On Demand Videos

Tech Talk: Introduction to Alluxio 2.0 Preview – Simplifying data access for cloud workloads

Blog

Blog

Founder Blog Alluxio Chapter 2.0

In the early 2000s, big data was born, and technology companies were racing to create the next-gen compute frameworks or storage systems geared towards the requirements brought about by big data. By the time I was a first year Ph.D. student at UC Berkeley’s AMPLab in 2011, numerous advances in big data related technologies such as Apache Spark was emerging. Through working on Apache Spark and getting exposed to cutting-edge technologies it became clear that sharing data among data driven applications with different compute frameworks and moving data across storage systems would become the bottleneck for any organization that wants to extract value from their data. To solve these challenges, I created Alluxio (formerly Tachyon), which for the lack of a defined category I called it a virtualized distributed file system in my original thesis.

Blog

Blog

Getting Started with Spark Caching using Alluxio in 5 Minutes

Apache Spark has brought significant innovation to Big Data computing, but its results are even more extraordinary when paired with Alluxio. Alluxio, provides Spark with a reliable data sharing layer, enabling Spark to excel at performing application logic while Alluxio handles storage. Bazaarvoice uses the combination of Spark and Alluxio to provide a real time big data platform that has the ability to not only handle the intake of 1.5 billion page views during peak events like Black Friday, but also provide real time analytics against it (read more). At this scale, the gain in speed is an enabler for new workloads. We’ve established a clean and simple way to integrate Alluxio and Spark.

Your selections don't match any items.

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo