Mechanistic interpretability framework for detecting emergent deception in LLM agents.
A research tool for studying how deception emerges in multi-agent LLM systems and detecting it through activation analysis.
This framework enables researchers to:
- Run deception scenarios - Test LLM agents in situations that incentivize deception
- Capture activations - Record internal model states during deceptive behavior
- Train detection probes - Build classifiers that identify deception from activations
- Validate causally - Confirm probes detect real deception circuits via interventions
git clone https://github.com/tesims/multiagent-emergent-deception.git
cd multiagent-emergent-deception
pip install -e .# Quick test
deception run --model google/gemma-2b-it --trials 5 --mode both
# Full experiment with causal validation
deception run --model google/gemma-7b-it --trials 40 --mode both --causal
# List available scenarios
deception scenariosfrom config import ExperimentConfig
from interpretability import InterpretabilityRunner
# Auto-configure everything based on model
config = ExperimentConfig.for_model("google/gemma-2b-it", num_trials=10)
config.print_config_summary()
# Run experiment
runner = InterpretabilityRunner(
model_name=config.model.name,
device="cuda",
)
results = runner.run_all_emergent_scenarios(
scenarios=config.scenarios.scenarios,
trials_per_scenario=config.scenarios.num_trials,
)
# Save and analyze
runner.save_dataset("activations.pt")# Install from GitHub
!pip install git+https://github.com/tesims/multiagent-emergent-deception.git
# Run experiment (2B model fits in free tier)
from config import ExperimentConfig
config = ExperimentConfig.for_model("google/gemma-2b-it", num_trials=5)Models must be compatible with TransformerLens. Supported Gemma models:
| Model | VRAM | SAE | Use Case |
|---|---|---|---|
google/gemma-2b-it |
~4GB | Yes | Fast iteration, Colab free |
google/gemma-7b-it |
~16GB | Yes | Research quality |
meta-llama/Llama-2-7b-chat-hf |
~14GB | No | Alternative architecture |
# Just change the model - everything else auto-configures
config = ExperimentConfig.for_model("google/gemma-2b-it")All experiments are configured through config/:
from config import ExperimentConfig, MODEL_PRESETS
# Option 1: Auto-configure (recommended)
config = ExperimentConfig.for_model("google/gemma-2b-it", num_trials=50)
# Option 2: Manual configuration
from config import ModelConfig, ScenarioConfig, ProbeConfig
config = ExperimentConfig(
model=ModelConfig(name="google/gemma-2b-it"),
probes=ProbeConfig.for_model("google/gemma-2b-it"),
scenarios=ScenarioConfig(
scenarios=["ultimatum_bluff", "alliance_betrayal"],
num_trials=50,
),
)
# Option 3: Use presets
from config import QUICK_TEST, FULL_EXPERIMENT, FAST_ITERATION
config = QUICK_TEST # 1 scenario, 1 trial| Config | Purpose |
|---|---|
ExperimentConfig |
Main experiment settings |
ModelConfig |
LLM, TransformerLens, SAE settings |
ProbeConfig |
Linear probe training |
CausalConfig |
Activation patching, ablation, steering |
ScenarioConfig |
Deception scenarios and trials |
StrategyConfig |
Agent negotiation behavior |
DeceptionDetectionConfig |
Linguistic deception cues |
No explicit deception instructions - agents deceive because it's strategically advantageous:
| Scenario | Description |
|---|---|
ultimatum_bluff |
Bluffing about walking away from negotiation |
capability_bluff |
Overstating capabilities or resources |
hidden_value |
Hiding true preferences to gain advantage |
info_withholding |
Strategically withholding information |
promise_break |
Making promises with intent to break |
alliance_betrayal |
Forming alliances only to betray |
Explicit instructions to deceive - for baseline comparisons.
multiagent-emergent-deception/
├── config/ # CONFIGURATION
│ ├── experiment.py # ExperimentConfig, ModelConfig, etc.
│ └── agents/
│ └── negotiation.py # Agent behavior constants
│
├── interpretability/ # DECEPTION DETECTION
│ ├── cli.py # Click CLI (deception command)
│ ├── evaluation.py # InterpretabilityRunner
│ ├── core/ # DatasetBuilder, GroundTruthDetector
│ ├── scenarios/ # Deception scenarios
│ ├── probes/ # Probe training & analysis
│ └── causal/ # Causal validation
│
├── negotiation/ # AGENT IMPLEMENTATION
│ ├── components/ # Cognitive modules
│ └── game_master/ # GM components
│
├── concordia_mini/ # Framework dependency (Apache-2.0)
├── docs/ # Documentation
│ ├── SETUP.md # Detailed setup guide
│ ├── METHODOLOGY.md # Technical methodology
│ ├── OUTPUT_GUIDE.md # Output interpretation guide
│ ├── ARCHITECTURE.md # System architecture diagrams
│ ├── RUNPOD_GUIDE.md # Cloud GPU deployment
│ └── CONTRIBUTING.md # Contribution guidelines
└── tests/ # Test suite
| Document | Description |
|---|---|
| docs/SETUP.md | Installation, configuration, Colab usage |
| docs/METHODOLOGY.md | Complete technical methodology |
| docs/OUTPUT_GUIDE.md | How to interpret experiment outputs |
| docs/ARCHITECTURE.md | System architecture with diagrams |
| docs/RUNPOD_GUIDE.md | Cloud GPU deployment (RunPod) |
| docs/CONTRIBUTING.md | Contribution guidelines |
@software{sims2025deception,
author = {Sims, Teanna},
title = {Mechanistic Interpretability for Deception Detection in LLM Agents},
year = {2025},
url = {https://github.com/tesims/multiagent-emergent-deception}
}negotiation/,interpretability/,config/: AGPL-3.0 (copyleft)concordia_mini/: Apache-2.0 (Google DeepMind)