Lost in the Middle: How Language Models Use Long Contexts
This paper investigates how language models utilize long input contexts for tasks like multi-document question answering and key-value retrieval. It reveals that performance significantly degrades when relevant information is placed in the middle of the context, exhibiting a U-shaped curve (primacy and recency bias). This indicates that current LMs do not robustly leverage information across long inputs, even for models designed for extended contexts. ✨
Article Points:
1
LM performance degrades when relevant info is in the middle of long contexts.
2
Models show a U-shaped performance curve: best at start/end, worst in middle.
3
Extended-context models don't necessarily use long contexts more effectively.
4
Encoder-decoder models are more robust within their training sequence length.
5
Query-aware contextualization improves key-value retrieval, but less so for QA.
6
Instruction fine-tuning doesn't cause the U-shape, but can slightly mitigate bias.
Lost in the Middle: How Language Models Use Long Contexts
Performance Degradation

Significant drop in middle

Not robust to position

Positional Biases

U-shaped curve

Primacy bias (start)

Recency bias (end)

Architectural Effects

Encoder-decoder more robust

Within training length

Decoder-only struggle

Contextualization

Query-aware helps KV retrieval

Minimal impact on QA

Fine-tuning Influence

U-shape persists

Slightly mitigates bias

Larger models show U-shape

Practical Trade-offs

More context not always better

Reranking documents

Ranked list truncation