Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy

Previous Card

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

This study investigates how varying levels of prompt politeness (Very Polite, Polite, Neutral, Rude, Very Rude) affect ChatGPT-4o's accuracy on multiple-choice questions. Contrary to some prior findings, impolite prompts consistently outperformed polite ones, with Very Rude prompts achieving the highest accuracy. These results suggest newer LLMs may respond differently to tonal variations, highlighting the importance of pragmatic aspects in human-AI interaction. ✨

Article Points:

Prompt politeness significantly influences LLM accuracy.

Impolite prompts consistently outperformed polite ones on ChatGPT-4o.

Very Rude prompts achieved 84.8% accuracy, while Very Polite prompts achieved 80.8%.

Newer LLMs like ChatGPT-4o may react differently to tonal variation than older models.

The study used 250 unique prompts across five politeness levels for evaluation.

Ethical concerns arise from impolite prompts yielding better performance, advocating for responsible AI.

Source:

Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy

prompt engineering evaluation openai

Research Question

Impact of politeness on LLM accuracy

Validate prior studies on tone

Methodology

Dataset: 250 questions, 5 politeness levels

LLM: ChatGPT-4o

Evaluation: Paired sample t-test

Key Findings

Impolite prompts outperformed polite ones

Very Rude: 84.8% accuracy

Very Polite: 80.8% accuracy

Tone differences statistically significant

Discussion & Ethics

Newer LLMs react differently to tone

Politeness as string of words

Avoid hostile interfaces

Future work: other models, perplexity

Source:

Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy

Next Card

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Impact of politeness on LLM accuracy

Validate prior studies on tone

Dataset: 250 questions, 5 politeness levels

LLM: ChatGPT-4o

Evaluation: Paired sample t-test

Impolite prompts outperformed polite ones

Very Rude: 84.8% accuracy

Very Polite: 80.8% accuracy

Tone differences statistically significant

Newer LLMs react differently to tone

Politeness as string of words

Avoid hostile interfaces

Future work: other models, perplexity

Why Language Models Hallucinate

Related Cards

ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants

Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning

LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

Tree of Thoughts: Deliberate Problem Solving with Large Language Models