Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Previous Card

LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

long context prompt engineering evaluation

This paper investigates the impact of extending input lengths on Large Language Models' (LLMs) reasoning capabilities. It introduces FLenQA, a novel QA framework, to isolate the effect of input length using various padding types and locations. Findings reveal a significant degradation in LLM reasoning performance at much shorter input lengths than their technical maximum, with next-word prediction negatively correlating with reasoning accuracy. ✨

Article Points:

LLM reasoning performance degrades significantly with longer inputs.

Degradation occurs at input lengths much shorter than technical maximum.

Next-word prediction negatively correlates with reasoning accuracy on long inputs.

Chain-of-Thought (CoT) prompting generally doesn't mitigate length-induced degradation.

LLMs exhibit failure modes: refusal to answer, label bias, early answers, poor CoT coverage.

FLenQA dataset was introduced to isolate input length as a variable.

Source:

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

long context prompt engineering evaluation

Problem Statement

LLMs claim long input support

Performance consistency unclear

Previous studies uncontrolled

Methodology(FLenQA Dataset)

Isolates input length variable

QA for text-based reasoning

Padding types & locations

Tasks: MonoRel, PIR, Ruletaker

Key Findings

Performance degrades significantly

Degradation at short lengths

Padding type affects intensity

Position bias observed

Next Word Prediction

Accuracy increases with length

Negatively correlates with reasoning

Not a substitute for task eval

Chain of Thought (CoT)

Improves short input accuracy

Doesn't mitigate length degradation

GPT4 is an exception

Failure Modes

Refusal to answer

Label bias (e.g., "false")

Answer before reasoning

Poor CoT coverage

Source:

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

LLMs claim long input support

Performance consistency unclear

Previous studies uncontrolled

Isolates input length variable

QA for text-based reasoning

Padding types & locations

Tasks: MonoRel, PIR, Ruletaker

Performance degrades significantly

Degradation at short lengths

Padding type affects intensity

Position bias observed

Accuracy increases with length

Negatively correlates with reasoning

Not a substitute for task eval

Improves short input accuracy

Doesn't mitigate length degradation

GPT4 is an exception

Refusal to answer

Label bias (e.g., "false")

Answer before reasoning

Poor CoT coverage

Where AI is failing design systems, and where we are failing AI

Related Cards

RAPTOR: RECURSIVE ABSTRACTIVE PROCESSING FOR TREE-ORGANIZED RETRIEVAL

The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs

Dummy's Guide to Modern LLM Sampling Intro Knowledge

Tricks or Traps? A Deep Dive into RL for LLM Reasoning