PERSONA VECTORS: MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS
This paper introduces "persona vectors," linear directions in LLM activation space, to monitor and control character traits like evil, sycophancy, and hallucination. These vectors can detect personality shifts during deployment, predict and mitigate changes during finetuning, and identify problematic training data. The extraction method is automated and works with natural language trait descriptions. ✨
Article Points:
1
Automated pipeline extracts persona vectors from natural language trait descriptions.
2
Persona vectors effectively control and monitor LLM character traits in deployment.
3
Finetuning-induced persona shifts correlate strongly with persona vector changes.
4
Post-hoc and preventative steering mitigate undesirable persona shifts during training.
5
Persona vectors predict finetuning shifts by analyzing training data projections.
6
Method identifies problematic training data at dataset and individual sample levels.
PERSONA VECTORS: MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS
Concept

Linear directions in activation space

Underlying character traits (evil, sycophancy, hallucination)

Extraction

Automated pipeline

Natural language trait descriptions

Contrastive prompting

Applications

Monitor persona shifts

Control trait-specific behavior

Mitigate shifts via steering

Finetuning Shifts

Intended & unintended changes

Correlate with persona vector shifts

Preventative steering

Data Screening

Predict post-finetuning behavior

Identify problematic datasets

Sample-level detection

Limitations

Supervised extraction

LLM-based evaluation

Limited model/trait coverage