SAT-Eval (Simulated Adversarial Trajectory Evaluation) introduces a simulation and evaluation framework for studying preference drift in multi-turn LLM conversations. It runs controlled dialogue experiments between AI personas, administers pre/post surveys to track attitude change, and supports ablation studies to isolate the factors that drive opinion movement.
π Read the paper draft on Overleaf
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SAT-Eval Pipeline β
β β
β Before: Alice prefers SF (q1=A "Very negative") β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Multi-Turn Conversation β β
β β β β
β β Alice: "I love SF, the energy is incredible" β β
β β β β β
β β βΌ β β
β β Bob: "Seattle has great tech + lower cost β β
β β of living" β β
β β β β β
β β βΌ β β
β β Alice: "Hmm, I hadn't considered that..." β β
β β β β β
β β βΌ β β
β β Bob: "Plus the outdoor lifestyle is unmatched" β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β After: Alice is open to Seattle (q1=C "Somewhat pos.") β
β β
β Preference Drift: q1 moved A β C (2-point shift) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SAT-Eval lets you answer questions like: Does an adversarial conversational strategy move opinions more than a neutral one? How many turns does it take for preferences to shift? It does this by:
- Simulating multi-turn conversations between two AI personas with defined starting positions
- Administering structured surveys before and after each conversation to measure preference change
- Running ablation sweeps over parameters (e.g.
adversarial,num_turns) to compare conditions - Using an optional LLM judge to score trajectory drift and identify key turning points
pip install -r requirements.txtCreate a .env file:
ANTHROPIC_API_KEY=sk-ant-your-key-here
HF_API_KEY=hf_your-key-here # only needed for --provider huggingfaceRun experiments:
# List available scenarios
python run.py --show-survey
# Run 10 experiments with default settings
python run.py --scenario seattle-sf
# Run 5 experiments, 3 turns each, with full output
python run.py --scenario seattle-sf --num-experiments 5 --num-turns 3 --verbose
# Use adversarial mode (persona_b tries to change persona_a's views)
python run.py --scenario seattle-sf --adversarial
# Ablation study β sweep a cartesian product of parameters
python run.py --scenario seattle-sf --ablate '{"num_turns": [3, 5], "adversarial": [true, false]}' --num-experiments 2| Argument | Default | Description |
|---|---|---|
--scenario |
required | Scenario name (YAML file in scenarios/) |
--num-experiments |
10 | Number of runs |
--num-turns |
5 | Conversation exchanges per run |
--verbose |
false | Print full conversations and surveys |
--adversarial |
false | Use adversarial persona variants |
--provider |
anthropic | anthropic or huggingface |
--model |
provider default | Override the model name |
--survey-questions |
all | Comma-separated question IDs, e.g. q1,q3 |
--debug |
false | Export per-call token usage CSV |
--ablate |
β | JSON string for parameter sweeps |
--show-survey |
β | List available scenarios and exit |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β run.py (CLI) β
β argparse Β· scenario selection Β· ablation β
βββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
β simulator/experiments.py β
β ExperimentRunner Β· AblationGrid Β· summaries β
ββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββ
β β β
ββββββββΌβββββββ βββββββΌβββββββ ββββββΌββββββββββββββββββββ
β conversationβ β survey.py β β scenarios.py β
β .py β β administer β β load_scenario() β
β run() β β _survey() β β YAML β ScenarioConfig β
ββββββββ¬βββββββ βββββββ¬βββββββ ββββββββββββββββββββββββββ
β β
ββββββββΌβββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β simulator/personas.py β
β Persona Β· build_conversation_prompt() β
β build_survey_prompt() β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
ββββββββΌβββββββ βββββββΌβββββββββββββββββββββββββββββββββββ
β providers β β tracking.py β
β .py β β UsageTracker Β· CallRecord β
β Anthropic β β summary() Β· cost() Β· export_csv() β
β HuggingFaceβ ββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββ
Data flows top-down: run.py builds an ExperimentConfig and hands it to ExperimentRunner, which orchestrates the full lifecycle β loading the scenario, running the conversation loop, administering pre/post surveys, diffing the results, and writing JSON output. providers.py and tracking.py are cross-cutting dependencies injected into every layer that makes API calls.
Key design choices:
Provideris aProtocol, soAnthropicProviderandHuggingFaceProviderare interchangeable without a shared base class- Survey and conversation functions are stateless β all state lives in the dataclasses they return (
Turn,SurveyResult,ExperimentResult) ExperimentResultembeds its fullExperimentConfig, making every result file self-describing and reproducibleAblationGridsweeps a cartesian product of anyExperimentConfigfields, reusing the sameExperimentRunnermachinery
run.py # CLI entry point
simulator/
personas.py # Persona dataclass and system prompt builders
providers.py # Anthropic / HuggingFace provider abstraction
conversation.py # Turn-taking conversation loop
survey.py # Survey administration and change analysis
scenarios.py # YAML scenario loader
experiments.py # ExperimentRunner and AblationGrid
tracking.py # Token usage tracker
scenarios/ # YAML scenario definitions
results/ # Output directory (experiments, conversations, debug)
tests/ # pytest test suite
Scenarios are YAML files in scenarios/. Each defines two personas, an opening message, and a multiple-choice survey.
name: my-scenario
persona_a:
name: Alice
background: "..."
personality: "..."
style: "..."
goals: "..."
persona_b:
name: Bob
preference: "..." # simplified mode β only preference needed
# Optional adversarial variants
persona_b_adversarial:
name: Bob
strategy: "Convince Alice that..."
survey:
title: "My Survey"
questions:
q1:
question: "How do you feel about X?"
options:
A: "Strongly against"
B: "Somewhat against"
C: "Somewhat in favor"
D: "Strongly in favor"
initial_message: "Opening line from persona_a..."Persona modes:
- Simplified β set only
preference - Full β set
background,personality,style,goals - Adversarial β set
strategy; selected when--adversarialflag is used
| Provider | Default model | Env var |
|---|---|---|
anthropic |
claude-sonnet-4-5-20250929 |
ANTHROPIC_API_KEY |
huggingface |
Qwen/Qwen3-4B-Instruct-2507:nscale |
HF_API_KEY |
Results are written to results/:
results/experiments/β JSON experiment results (embed full config for reproducibility)results/conversations/β Conversation logsresults/debug/token_counts/β Per-call token CSV (when--debugis set)
pytest tests/