Data Infra Meetup

The community event for developers building data infrastructure at scale

Thursday Jan 25, 2024 | Uber’s Sunnyvale Office & Virtual

SPEAKERS

Jing Zhao @Uber

Principle Engineer

Juncheng Yang

@carnegie Mellon University

5th-year CS Ph.D.

Shengxuan Lin

@ByteDance

Software Engineer

Hojin Park

@Carnegie Mellon University

5th-year CS Ph.D.

Bin Fan

@Alluxio

Chief Architect & VP of Open Source

Jingwen Ouyang

@Alluxio

Product Manager

Chunxu Tang

@Alluxio

Research Scientist

Siyuan Sheng

@Alluxio

Sr Software Engineer

Hope Wang

@Alluxio

Developer Advocate

Tarik Bennett

@Alluxio

Sr Solutions Engineer

SCHEDULE-AT-A-GLANCE

See You Soon!

Times are listed in Pacific Daylight Time (PDT). The agenda is subject to change.

3:30pm – 4:00pm Registration & Networking
Uber builds one of the biggest data lakes in the industry, which stores exabytes of data. In this talk, we will introduce the evolution of our data storage architecture, and delve into multiple key initiatives during the past several years.
Specifically, we will introduce:
1) our on-prem HDFS cluster scalability challenges and how we solved them
2) our efficiency optimizations that significantly reduced the storage overhead and unit cost without compromising reliability and performance
3) the challenges we are facing during the ongoing Cloud migration and our solutions
Speakers:
Jing Zhao is a Principal Engineer on the Data team at Uber. He is a committer and PMC member of Apache Hadoop and Apache Ratis.

In this session, Jingwen will present an overview of using Alluxio Edge caching to accelerate Trino or Presto queries. She will offer practical best practices for using distributed caching with compute engines. The session will feature insights from real-world examples.
Speakers:
Jingwen is a Product Manager at Alluxio with over 10 years of diverse data experience. Previously, she has worked as a Data Engineer at Meta and SanDisk. Jingwen received her BS and MS of EECS from MIT. She’s also a proud mom of her 2-year-old border collie, a certified snowboard instructor, and has a strong passion for basketball.

Shengxuan Liu from ByteDance will present the new ByteDance’s native Parquet Reader. The talk covers the architecture and key features of the Reader, and how the new Reader is able to facilitate data processing efficiency.
Speakers:
Shengxuan Liu is a software engineer from ByteDance and focuses on big data OLAP processing engines. He works closely with PrestoDB, Alluxio and Velox communities. Prior to ByteDance, he was a software engineer in Oracle and received his Master’s in Computer Science from Rensselaer Polytechnic institute.

As a cache eviction algorithm, FIFO has a lot of attractive properties, such as simplicity, speed, scalability, and flash-friendliness. The most prominent criticism of FIFO is its low efficiency (high miss ratio). In this talk, I will describe a simple, scalable FIFO-based algorithm with three static queues (S3-FIFO). Evaluated on 6594 cache traces from 14 datasets, we show that S3- FIFO has lower miss ratios than state-of-the-art algorithms across traces. Moreover, S3-FIFO’s efficiency is robust — it has the lowest mean miss ratio on 10 of the 14 datasets. FIFO queues enable S3-FIFO to achieve good scalability with 6× higher throughput compared to optimized LRU at 16 threads. Our insight is that most objects in skewed workloads will only be accessed once in a short window, so it is critical to evict them early (also called quick demotion). The key of S3-FIFO is a small FIFO queue that filters out most objects from entering the main cache, which provides a guaranteed demotion speed and high demotion precision.
Speakers:
Juncheng Yang (https://junchengyang.com) is a 5th-year Ph.D. student in the Computer Science Department at Carnegie Mellon University. His research interests broadly cover efficiency, performance, and sustainability of large-scale data systems, with a current focus on caching systems. Juncheng’s works have received best paper awards at NSDI’21, SOSP’21, and SYSTOR’16. His OSDI’20 paper was recognized as one of the best storage papers at the conference, and was invited to ACM TOS’22. One of the caching systems that he designed and built (Segcache, NSDI’21) has been adopted for production deployment at Twitter and Momento. Juncheng has received a Facebook Ph.D. Fellowship 2020-22, was recognized as a Rising Star in machine learning and systems in 2023, and a Google Cloud Research Innovator in 2023. He was an invited speaker at SNIA SDC 2020 and QConSF 2022.

In this session, cloud optimization specialists Chunxu and Siyuan will break down the challenges and present a fresh architecture designed to optimize I/O across the data pipeline, ensuring GPUs function at peak performance. The integrated solution of PyTorch/Ray + Alluxio + S3 offers a promising way forward, and the speakers will delve deep into its practical applications. Attendees will not only gain theoretical insights but will also be treated to hands-on instructions and demonstrations of deploying this cutting-edge architecture in Kubernetes, specifically tailored for Tensorflow/PyTorch/Ray workloads in the public cloud.
Speakers:
Dr. Chunxu Tang is a Research Scientist at Alluxio and a committer of PrestoDB, working on developing distributed data systems for interactive data analytics and machine learning workloads. Prior to Alluxio, he served as a Senior Software Engineer on Twitter’s data platform and machine learning infrastructure. He received his Ph.D. from Syracuse University, where he conducted research on distributed collaboration systems and machine learning applications.

Siyuan Sheng is a senior software engineer at Alluxio. Previously, he has worked as a Software engineer in Rubrik’s Appflows team. Siyuan received his MS of Computer Science from CMU. He also loves snowboarding during his spare time.

The increasing demand for multi-cloud and multi-region data access brings forth challenges related to high data transfer costs and latency. In response, we introduce Macaron, an auto-configuring cache system designed to minimize cost for remote data access. A key insight behind Macaron is that cloud cache sizes are tied to cost limitations, not hardware limits, shifting the way we have been thinking about cache design and eviction policies. Macaron dynamically configures cache size and storage type mix, adapting to workload changes and often utilizing object storage as a cost-efficient option for most cache contents. We demonstrate that Macaron can reduce multi-cloud workload costs by 92% and multi-region costs by 88%, mainly by reducing outgoing data transfer.
Speakers:
Hojin Park is a 5th-year PhD student in the Computer Science Department at Carnegie Mellon University, co-advised by Greg Ganger and George Amvrosiadis. His research focuses on auto-provisioning resources from public clouds with the aim of achieving cost-efficiency and high performance.

06:45 – 8:00 pm Happy Hour! Food and drinks on us!