PERSONA VECTORS: MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS

Previous Card

ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants

This paper introduces "persona vectors," linear directions in LLM activation space, to monitor and control character traits like evil, sycophancy, and hallucination. These vectors can detect personality shifts during deployment, predict and mitigate changes during finetuning, and identify problematic training data. The extraction method is automated and works with natural language trait descriptions. ✨

Article Points:

Automated pipeline extracts persona vectors from natural language trait descriptions.

Persona vectors effectively control and monitor LLM character traits in deployment.

Finetuning-induced persona shifts correlate strongly with persona vector changes.

Post-hoc and preventative steering mitigate undesirable persona shifts during training.

Persona vectors predict finetuning shifts by analyzing training data projections.

Method identifies problematic training data at dataset and individual sample levels.

Source:

PERSONA VECTORS: MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS

anthropic fine tuning evaluation

Concept

Linear directions in activation space

Underlying character traits (evil, sycophancy, hallucination)

Extraction

Automated pipeline

Natural language trait descriptions

Contrastive prompting

Applications

Monitor persona shifts

Control trait-specific behavior

Mitigate shifts via steering

Finetuning Shifts

Intended & unintended changes

Correlate with persona vector shifts

Preventative steering

Data Screening

Predict post-finetuning behavior

Identify problematic datasets

Sample-level detection

Limitations

Supervised extraction

LLM-based evaluation

Limited model/trait coverage

Source:

PERSONA VECTORS: MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS

Next Card

ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants

Linear directions in activation space

Underlying character traits (evil, sycophancy, hallucination)

Automated pipeline

Natural language trait descriptions

Contrastive prompting

Monitor persona shifts

Control trait-specific behavior

Mitigate shifts via steering

Intended & unintended changes

Correlate with persona vector shifts

Preventative steering

Predict post-finetuning behavior

Identify problematic datasets

Sample-level detection

Supervised extraction

LLM-based evaluation

Limited model/trait coverage

ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization

Related Cards

Learning Facts at Scale with Active Reading

What makes Claude Code so damn good (and how to recreate that magic in your agent)!?

AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants