최대 1 분 소요

This is a brief review for “REFRAG: Rethinking RAG‑based Decoding”.
You can see the paper at this link.

Overview

REFRAG speeds up retrieval‑augmented generation by compressing, sensing, and expanding over retrieved passages, exploiting block‑diagonal attention patterns common in RAG. The method reports large TTFT speedups (~31×) and enables up to 16× longer contexts without accuracy loss, while keeping model architecture unchanged.

Key Ideas

  • Recognizes most RAG tokens are irrelevant to a given step; prunes compute accordingly.
  • ‘Sense’ relevant blocks on‑the‑fly and selectively expand only needed segments.
  • Delivers large latency and memory wins, especially for long‑context RAG workloads.

Why it matters

Addresses a core efficiency bottleneck in production RAG systems, improving responsiveness without fine‑tuning the base model.

References

댓글남기기