PERSONA VECTORS: MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS
This paper introduces "persona vectors," linear directions in LLM activation space, to monitor and control character traits like evil, sycophancy, and hallucination. These vectors can detect personality shifts during deployment, predict and mitigate changes during finetuning, and identify problematic training data. The extraction method is automated and works with natural language trait descriptions. ✨
Article Points:
1
Automated pipeline extracts persona vectors from natural language trait descriptions.
2
Persona vectors effectively control and monitor LLM character traits in deployment.
3
Finetuning-induced persona shifts correlate strongly with persona vector changes.
4
Post-hoc and preventative steering mitigate undesirable persona shifts during training.
5
Persona vectors predict finetuning shifts by analyzing training data projections.
6
Method identifies problematic training data at dataset and individual sample levels.
PERSONA VECTORS: MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS
Concept
Linear directions in activation space
Underlying character traits (evil, sycophancy, hallucination)
Extraction
Automated pipeline
Natural language trait descriptions
Contrastive prompting
Applications
Monitor persona shifts
Control trait-specific behavior
Mitigate shifts via steering
Finetuning Shifts
Intended & unintended changes
Correlate with persona vector shifts
Preventative steering
Data Screening
Predict post-finetuning behavior
Identify problematic datasets
Sample-level detection
Limitations