Community Office Hour: Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio

December 19, 2019

Bin Fan

VP of Technology

Alluxio

ALLUXIO COMMUNITY OFFICE HOUR

While adoption of the Cloud & Kubernetes has made it exceptionally easy to scale compute, the increasing spread of data across different systems and clouds has created new challenges for data engineers. Effectively accessing data from AWS S3 or on-premises HDFS becomes harder and data locality is also lost – how do you move data to compute workers efficiently, how do you unify data across multiple or remote clouds, and many more. Open source project Alluxio approaches this problem in a new way. It helps elastic compute workloads, such as Apache Spark, realize the true benefits of the cloud while bringing data locality and data accessibility to workloads orchestrated by Kubernetes.

One important performance optimization in Apache Spark is to schedule tasks on nodes with HDFS data nodes locally serving the task input data. However, more users are running Apache Spark natively on Kubernetes where HDFS is not an option. This office hour describes the concept and dataflow with respect to using the stack of Spark/Alluxio in Kubernetes with enhanced data locality even if the storage service is outside or remote.

In this Office Hour we’ll go over:

Why Spark is able to make a locality-aware schedule when working with Alluxio in K8s environment using the host network
Why a pod running Alluxio can share data efficiently with a pod running Spark on the same host using domain socket and host path volume
The roadmap to improve this Spark / Alluxio stack in the context of K8s

ALLUXIO COMMUNITY OFFICE HOUR

In this Office Hour we’ll go over:

Why Spark is able to make a locality-aware schedule when working with Alluxio in K8s environment using the host network
Why a pod running Alluxio can share data efficiently with a pod running Spark on the same host using domain socket and host path volume
The roadmap to improve this Spark / Alluxio stack in the context of K8s

Video:

Slides:

Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio from Alluxio, Inc.

‍

Videos:

Presentation Slides:

Community Office Hour: Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio from Alluxio, Inc.

Video:

Slides:

Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio from Alluxio, Inc.

‍

Videos:

Presentation Slides:

Community Office Hour: Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio from Alluxio, Inc.

Complete the form below to access the full overview:

Videos

AI/ML Infra Meetup | LLM Agents and Implementation Challenges

In this talk, Pritish Udgata from Adobe provides a comprehensive overview of implementation challenges and solutions for LLM agents.

Topic include:

CoT vs RAG vs Agentic AI
Anatomy of an agent
Single Agent with MCP
Multi Agents with A2A
Implementation Challenges and Solutions

August 14, 2025

Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency

Watch this on-demand video to learn about the latest release of Alluxio Enterprise AI. In this webinar, discover how Alluxio AI 3.7 eliminates cloud storage latency bottlenecks with breakthrough sub-millisecond performance, delivering up to 45× faster data access than S3 Standard without changing your code. Alluxio AI 3.7 is also packed with new features designed to supercharge your AI infrastructure while keeping your data secure.Key highlights include:

Alluxio Ultra Low Latency Caching for Cloud Storage
Role-Based Access Control (RBAC) for S3 Access
5X Faster Cache Preloading with Alluxio Distributed Cache Preloader
FUSE Non-Disruptive Upgrade
Other New Features for Alluxio Admins

August 13, 2025

Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale

Real-time OLAP databases are optimized for speed and often rely on tightly coupled storage-compute architectures using disks or SSDs. Decoupled architectures, which use cloud object storage, introduce an unavoidable tradeoff: cost efficiency at the expense of performance. This makes them unsuitable for databases that need to provide low-latency, real-time analytics, especially the new wave of LLM-powered dashboards, retrieval-augmented generation (RAG), and vector-embedding searches that thrive only when fresh data is milliseconds away. Can we achieve both cost efficiency and performance?

In this talk, we’ll explore the engineering challenges of extending Apache Pinot—a real-time OLAP system—onto cloud object storage while still maintaining sub-second P99 latencies.

We’ll dive into how we built an abstraction in Apache Pinot to make it agnostic to the location of data. We’ll explain how we can query data directly from the cloud (without needing to download the entire dataset, as with lazy-loading) while achieving sub-second latencies. We’ll cover the data fetch and optimization strategies we implemented, such as pipelining fetch and compute, prefetching, selective block fetches, index pinning, and more. We'll also share our latest work about integration with open table formats like iceberg, and how we will continue to achieve fast analytics directly on parquet files by implementing all the same techniques that apply to tiered storage.

‍

July 15, 2025

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo