Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
This paper investigates the impact of extending input lengths on Large Language Models' (LLMs) reasoning capabilities. It introduces FLenQA, a novel QA framework, to isolate the effect of input length using various padding types and locations. Findings reveal a significant degradation in LLM reasoning performance at much shorter input lengths than their technical maximum, with next-word prediction negatively correlating with reasoning accuracy. ✨
Article Points:
1
LLM reasoning performance degrades significantly with longer inputs.
2
Degradation occurs at input lengths much shorter than technical maximum.
3
Next-word prediction negatively correlates with reasoning accuracy on long inputs.
4
Chain-of-Thought (CoT) prompting generally doesn't mitigate length-induced degradation.
5
LLMs exhibit failure modes: refusal to answer, label bias, early answers, poor CoT coverage.
6
FLenQA dataset was introduced to isolate input length as a variable.
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
Problem Statement

LLMs claim long input support

Performance consistency unclear

Previous studies uncontrolled

Methodology(FLenQA Dataset)

Isolates input length variable

QA for text-based reasoning

Padding types & locations

Tasks: MonoRel, PIR, Ruletaker

Key Findings

Performance degrades significantly

Degradation at short lengths

Padding type affects intensity

Position bias observed

Next Word Prediction

Accuracy increases with length

Negatively correlates with reasoning

Not a substitute for task eval

Chain of Thought (CoT)

Improves short input accuracy

Doesn't mitigate length degradation

GPT4 is an exception

Failure Modes

Refusal to answer

Label bias (e.g., "false")

Answer before reasoning

Poor CoT coverage