Lost in the Middle: How Language Models Use Long Contexts
This paper investigates how language models utilize long input contexts for tasks like multi-document question answering and key-value retrieval. It reveals that performance significantly degrades when relevant information is placed in the middle of the context, exhibiting a U-shaped curve (primacy and recency bias). This indicates that current LMs do not robustly leverage information across long inputs, even for models designed for extended contexts. ✨
Article Points:
1
LM performance degrades when relevant info is in the middle of long contexts.
2
Models show a U-shaped performance curve: best at start/end, worst in middle.
3
Extended-context models don't necessarily use long contexts more effectively.
4
Encoder-decoder models are more robust within their training sequence length.
5
Query-aware contextualization improves key-value retrieval, but less so for QA.
6
Instruction fine-tuning doesn't cause the U-shape, but can slightly mitigate bias.
Lost in the Middle: How Language Models Use Long Contexts
Performance Degradation
Significant drop in middle
Not robust to position
Positional Biases
U-shaped curve
Primacy bias (start)
Recency bias (end)
Architectural Effects
Encoder-decoder more robust
Within training length
Decoder-only struggle
Contextualization
Query-aware helps KV retrieval
Minimal impact on QA
Fine-tuning Influence
U-shape persists
Slightly mitigates bias
Larger models show U-shape
Practical Trade-offs