Bay Area Meetup: Alluxio 2.0 Deep Dive and Near Real-time Analytics with Spark

July 23, 2019

Calvin Jia

We are excited to present Alluxio 2.0 to our community. The goal of Alluxio 2.0 was to significantly enhance data accessibility with improved APIs, expand use cases supported to include active workloads as well as better metadata management and availability to support hyperscale deployments. Alluxio 2.0 Preview Release is the first major milestone on this path to Alluxio 2.0 and includes many new features.

In this talk, I will give an overview of the motivations and design decisions behind the major changes in the Alluxio 2.0 release. We will touch on the key features:

– New off-Heap metadata storage leveraging embedded RocksDB to scale up Alluxio to handle a billion files;
– Improved Alluxio POSIX API to support legacy and machine-learning workloads;
– A fully contained, distributed embedded journal system based on RAFT consensus algorithm in high availability mode;
– A lightweight distributed compute framework called “Alluxio Job Service” to support Alluxio operations such as active replication, async-persist, cross mount move/copy and distributed loading;
– Support for mounting and connecting to any number of HDFS clusters of different versions at the same time;
Active file system sync between Alluxio and HDFS as under storage.

Alluxio 2.0 Preview Release Deep Dive

Video:

Presentation slides:

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio from Alluxio, Inc.

Real-time Data Processing for Sales Attribution Analysis with Alluxio, Spark and Hive at VIPShop

Vipshop is a leading eCommerce company in China with over 15 million active daily users. Our ETL jobs primarily run against data on HDFS, which can no longer meet the increasing swiftness and stability demand for certain real-time jobs. In this talk, I will explain how we’ve replaced HDFS with Memory+ HDD managed by Alluxio to speed up data accesses for all our Sales Attribution applications running on Spark and Hive, this system has been in production for more than 2 years. As more old fashion ETL SQLs are being converted into real-time jobs, leveraging Alluxio for caching has become one of the widely considered performance tuning solution. I will share our criteria when selecting use cases that can effectively get a boost by switching to Alluxio.

Our future work includes using Alluxio as an abstraction layer for the \tmp\ directory in our main Hadoop clusters, and we are also considering Alluxio to cache the hot data in our 600+ node Presto clusters.

Bio:
Wanchun Wang is the Chief Architect and has been with VIPShop for over 5 years and his interests focus on processing large amounts of data such as building streaming pipelines, optimizing ETL applications, and designing in-house ML & DL platforms. He is currently managing big data teams that are responsible for batch, real-time, and data warehouse systems.

Video:

Acknowledgment:
Our event partner AICamp (http://www.xnextcon.com) is a global online platform for engineers, data scientists to learn and practice AI, ML, DL, Data Science, with 80000+ developers, and 40+ cities local study groups around the world.

‍

Videos:

Presentation Slides:

Bay Area Meetup: Alluxio 2.0 Deep Dive and Near Real-time Analytics with Spark from Alluxio, Inc.

Alluxio 2.0 Preview Release Deep Dive

Watch On-demand

Video:

Presentation slides:

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio from Alluxio, Inc.

Real-time Data Processing for Sales Attribution Analysis with Alluxio, Spark and Hive at VIPShop

Video:

‍

Videos:

Presentation Slides:

Bay Area Meetup: Alluxio 2.0 Deep Dive and Near Real-time Analytics with Spark from Alluxio, Inc.

Complete the form below to access the full overview:

Videos

AI/ML Infra Meetup | LLM Agents and Implementation Challenges

In this talk, Pritish Udgata from Adobe provides a comprehensive overview of implementation challenges and solutions for LLM agents.

Topic include:

CoT vs RAG vs Agentic AI
Anatomy of an agent
Single Agent with MCP
Multi Agents with A2A
Implementation Challenges and Solutions

August 14, 2025

Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency

Watch this on-demand video to learn about the latest release of Alluxio Enterprise AI. In this webinar, discover how Alluxio AI 3.7 eliminates cloud storage latency bottlenecks with breakthrough sub-millisecond performance, delivering up to 45× faster data access than S3 Standard without changing your code. Alluxio AI 3.7 is also packed with new features designed to supercharge your AI infrastructure while keeping your data secure.Key highlights include:

Alluxio Ultra Low Latency Caching for Cloud Storage
Role-Based Access Control (RBAC) for S3 Access
5X Faster Cache Preloading with Alluxio Distributed Cache Preloader
FUSE Non-Disruptive Upgrade
Other New Features for Alluxio Admins

August 13, 2025

Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale

Real-time OLAP databases are optimized for speed and often rely on tightly coupled storage-compute architectures using disks or SSDs. Decoupled architectures, which use cloud object storage, introduce an unavoidable tradeoff: cost efficiency at the expense of performance. This makes them unsuitable for databases that need to provide low-latency, real-time analytics, especially the new wave of LLM-powered dashboards, retrieval-augmented generation (RAG), and vector-embedding searches that thrive only when fresh data is milliseconds away. Can we achieve both cost efficiency and performance?

In this talk, we’ll explore the engineering challenges of extending Apache Pinot—a real-time OLAP system—onto cloud object storage while still maintaining sub-second P99 latencies.

We’ll dive into how we built an abstraction in Apache Pinot to make it agnostic to the location of data. We’ll explain how we can query data directly from the cloud (without needing to download the entire dataset, as with lazy-loading) while achieving sub-second latencies. We’ll cover the data fetch and optimization strategies we implemented, such as pipelining fetch and compute, prefetching, selective block fetches, index pinning, and more. We'll also share our latest work about integration with open table formats like iceberg, and how we will continue to achieve fast analytics directly on parquet files by implementing all the same techniques that apply to tiered storage.

‍

July 15, 2025

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo