AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.

Video:

Presentation slides:

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG from Alluxio, Inc.

Speaker:

Junchen Jiang is an Assistant Professor of Computer Science at the University of Chicago. He received his Ph.D. from CMU in 2017 and his bachelor’s degree from Tsinghua in 2011. His research interests are networked systems and their intersections with machine learning. He has received a Google Faculty Research Award, an NSF CAREER Award, and a CMU Computer Science Doctoral Dissertation Award.