[Paper] REFRAG: Rethinking RAG‑based Decoding

최대 1 분 소요

This is a brief review for “REFRAG: Rethinking RAG‑based Decoding”.
You can see the paper at this link.

Overview

REFRAG speeds up retrieval‑augmented generation by compressing, sensing, and expanding over retrieved passages, exploiting block‑diagonal attention patterns common in RAG. The method reports large TTFT speedups (~31×) and enables up to 16× longer contexts without accuracy loss, while keeping model architecture unchanged.

Key Ideas

Recognizes most RAG tokens are irrelevant to a given step; prunes compute accordingly.
‘Sense’ relevant blocks on‑the‑fly and selectively expand only needed segments.
Delivers large latency and memory wins, especially for long‑context RAG workloads.

Why it matters

Addresses a core efficiency bottleneck in production RAG systems, improving responsiveness without fine‑tuning the base model.

References

arXiv
alphaXiv overview
Marktechpost explainer

Twitter Facebook LinkedIn

Yejin Kim

[Paper] REFRAG: Rethinking RAG‑based Decoding

Overview

Key Ideas

Why it matters

References

공유하기

댓글남기기

참고

[Paper Review] TRPO and PPO

[Paper] Mamba: Linear-Time Sequence Modeling with Selective State Spaces

[SSM] Modeling Sequences with Structured State Spaces - Part I

[LLM-RL] Lecture 2: Value Functions