Previous Card
LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
This paper investigates the impact of extending input lengths on Large Language Models' (LLMs) reasoning capabilities. It introduces FLenQA, a novel QA framework, to isolate the effect of input length using various padding types and locations. Findings reveal a significant degradation in LLM reasoning performance at much shorter input lengths than their technical maximum, with next-word prediction negatively correlating with reasoning accuracy. ✨
Article Points:
1
LLM reasoning performance degrades significantly with longer inputs.
2
Degradation occurs at input lengths much shorter than technical maximum.
3
Next-word prediction negatively correlates with reasoning accuracy on long inputs.
4
Chain-of-Thought (CoT) prompting generally doesn't mitigate length-induced degradation.
5
LLMs exhibit failure modes: refusal to answer, label bias, early answers, poor CoT coverage.
6
FLenQA dataset was introduced to isolate input length as a variable.
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
Problem Statement
LLMs claim long input support
Performance consistency unclear
Previous studies uncontrolled
Methodology(FLenQA Dataset)
Isolates input length variable
QA for text-based reasoning
Padding types & locations
Tasks: MonoRel, PIR, Ruletaker
Key Findings
Performance degrades significantly
Degradation at short lengths
Padding type affects intensity
Position bias observed
Next Word Prediction
Accuracy increases with length
Negatively correlates with reasoning
Not a substitute for task eval
Chain of Thought (CoT)
Improves short input accuracy
Doesn't mitigate length degradation
GPT4 is an exception
Failure Modes