Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks
This paper introduces Cache-Augmented Generation (CAG) as an alternative to Retrieval-Augmented Generation (RAG) for knowledge tasks, leveraging long-context LLMs. CAG preloads relevant resources into the LLM's extended context and caches runtime parameters, eliminating real-time retrieval. This approach significantly reduces latency, minimizes retrieval errors, and simplifies system complexity while achieving comparable or superior performance for constrained knowledge bases. ✨
Article Points:
1
CAG bypasses real-time retrieval by preloading knowledge into LLM context.
2
CAG precomputes and caches LLM's KV parameters for inference.
3
CAG eliminates retrieval latency and minimizes retrieval errors.
4
CAG simplifies system architecture compared to RAG.
5
Experiments show CAG outperforms RAG in efficiency and accuracy for certain tasks.
6
CAG is optimal for scenarios with extensive, manageable reference contexts.
Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks
Challenges of RAG

Retrieval latency

Document selection errors

Increased system complexity

CAG Solution

Preload all relevant resources

Precompute KV cache

Utilize long-context LLMs

Methodology

External Knowledge Preloading

Inference with cached context

Efficient Cache Reset

Advantages of CAG

Reduced inference time

Unified context understanding

Simplified architecture

Experimental Results

Outperforms RAG in BERTScore

Faster generation time

Effective for manageable knowledge bases

Future Outlook

Hybrid approaches possible

Leverage evolving LLM capabilities