The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
This paper re-evaluates hallucination detection methods in LLMs, arguing that common metrics like ROUGE misalign with human judgment. Through comprehensive human studies, it demonstrates that ROUGE's low precision leads to misleading performance estimates, with established methods showing significant drops when assessed by human-aligned metrics like LLM-as-Judge. The study also reveals that simple response length heuristics can rival complex detection techniques, exposing fundamental flaws in current evaluation practices and emphasizing the need for semantically aware frameworks. ✨
Article Points:
1
ROUGE and similar metrics severely misalign with human judgments for factual correctness.
2
LLM-as-Judge is a reliable, human-aligned metric for factual correctness.
3
Existing hallucination detection methods are often overstated when evaluated with ROUGE.
4
Response length is a surprisingly effective indicator of hallucination.
5
Simple length-based heuristics can match or exceed sophisticated detectors.
6
Current evaluation practices inflate reported performance and hide flaws.
The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
Current Evaluation Flaws

ROUGE misaligns with human judgment

Inflated performance estimates

Sensitive to response length

Human-Aligned Evaluation

LLM-as-Judge is reliable

Captures factual correctness

Surprising Length Factor

Hallucinated responses are longer

Length correlates with metrics

Simple length is competitive baseline

Re-evaluation Results

Methods show dramatic performance drops

Weak correlation with ROUGE

Implications & Future

Shift to robust evaluation

Avoid over-engineering detectors

Study Limitations

Limited LLM/dataset scope

Length heuristic nuances

LLM-as-Judge biases