Why Language Models Hallucinate
This paper argues that large language models hallucinate because their training and evaluation procedures reward guessing over admitting uncertainty. It analyzes the statistical origins of hallucinations during pretraining, linking them to binary classification errors, and explains their persistence due to misaligned post-training benchmarks. The authors propose modifying existing evaluation scoring to encourage more trustworthy AI systems by penalizing overconfident falsehoods. ✨
Article Points:
1
LM hallucinations stem from training and evaluation rewarding guessing over uncertainty.
2
Pretraining errors are statistical, analogous to binary classification misclassifications.
3
Hallucinations persist because post-training evaluations penalize uncertainty and abstention.
4
Modifying existing benchmarks to reward uncertainty is crucial for effective mitigation.
5
Explicit confidence targets in evaluations can foster trustworthy LM behavior.
6
Arbitrary facts, poor models, and GIGO contribute to pretraining errors.
Why Language Models Hallucinate
Pretraining Origins

Statistical Errors

- Binary Classification Analogy
- Epistemic Uncertainty
- Arbitrary Facts (Singletons)

Poor Models

- Limited Representation
- N-gram Models Example

Additional Factors

- Computational Hardness
- Distribution Shift
- GIGO (Garbage In, Garbage Out)
Post-training Persistence

Evaluation Misalignment

- Binary Grading Penalizes Uncertainty
- Guessing Maximizes Score

Socio-technical Problem

- Need for Benchmark Modification
- Existing Hallucination Evals Insufficient
Mitigation Strategies

Explicit Confidence Targets

- Penalties for Incorrect Answers
- Behavioral Calibration

Integrate into Mainstream Evals

- Adjust Scoring of Benchmarks
- Reward Uncertainty Expressions