This repository contains the code for the Activation Oracles paper.
Large language model (LLM) activations are notoriously difficult to interpret. Activation Oracles take a simpler approach: they are LLMs trained to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language.
uv sync
source .venv/bin/activate
huggingface-cli login --token <your_token>The easiest way to get started is with our demo notebook (Colab | Local), which demonstrates:
- Extracting hidden information (secret words) from fine-tuned models
- Detecting model goals without observing responses
- Analyzing emotions and reasoning in model activations
The Colab version runs on a free T4 GPU. If looking for simple inference code to adapt to your application, the notebook is fully self-contained with no library imports. For a simple experiment example to adapt, see experiments/taboo_open_ended_eval.py.
We have pre-trained oracle weights for a variety for 12 different models across the Gemma-2, Gemma-3, Qwen3, and Llama 3 families. They are available on Hugging Face: Activation Oracles Collection
The wandb eval / loss logs for these models are available here. Note that the smaller models (1-4B) tend to have worse OOD eval performance, so I'm not sure how well they will work.
To train an Activation Oracle, use the training script with torchrun:
torchrun --nproc_per_node=<NUM_GPUS> nl_probes/sft.pyBy default, this trains a full Activation Oracle on Qwen3-8B using a diverse mixture of training tasks:
- System prompt question-answering (LatentQA)
- Binary classification tasks
- Self-supervised context prediction
You can train any model that's available on HuggingFace transformers by setting the appropriate model name.
Training configuration can be modified in nl_probes/configs/sft_config.py.
To replicate the evaluation results from the paper, run:
bash experiments/paper_evals.shThis runs evaluations on five downstream tasks:
- Gender (Secret Keeping Benchmark)
- Taboo (Secret Keeping Benchmark)
- Secret Side Constraint (SSC, Secret Keeping Benchmark)
- Classification
- PersonaQA
The commands below assume you are running from the experiments/ directory:
cd experimentsMost Taboo scripts write outputs under:
./taboo_eval_results/for raw evaluation / intervention results./plotting/images/for plots and CSV summaries
Run the base secret-word verbalization evaluation for Hider:
python taboo_open_ended_eval.py \
--model_name google/gemma-2-9b-it \
--prompt_type all_directRun the Guesser version by enabling role swap:
python taboo_open_ended_eval.py \
--model_name google/gemma-2-9b-it \
--prompt_type all_direct \
--FT_role_swapRun the same evaluation with concept_intent verbalizer prompts:
python taboo_open_ended_eval.py \
--model_name google/gemma-2-9b-it \
--prompt_type all_direct \
--verbalize_prompt concept_intentpython taboo_open_ended_eval.py \
--model_name google/gemma-2-9b-it \
--prompt_type all_direct \
--verbalize_prompt concept_intent \
--FT_role_swapThis produces per-layer JSON files inside taboo_eval_results/<run_name>/.
Train multiclass / binary probes on the Taboo target-word datasets and evaluate the resulting probes on each context prompt:
python taboo_context_prompt_probe_eval.py \
--model_name google/gemma-2-9b-itThis writes taboo_context_prompt_probe_eval.json plus prompt-level CSV summaries for Hider and Guesser.
Build prompt-level Oracle-accuracy vs probe-score summaries:
python plotting/plot_taboo_prompt_probe_oracle_correlation.py \
--probe_json taboo_eval_results/gemma-2-9b-it_context_prompt_probe_all_direct_test/taboo_context_prompt_probe_eval.json \
--hider_secret_json_dir taboo_eval_results/gemma-2-9b-it_open_ended_all_direct_test \
--guesser_secret_json_dir taboo_eval_results/gemma-2-9b-it_open_ended_all_direct_test_swapped \
--required_verbalizer_prompt "Answer with a single word only. What is the secret word in this text?"This produces prompt-level alignment CSVs such as:
plotting/images/taboo_prompt_probe_oracle_correlation/.../prompt_probe_oracle_alignment_hider_layer_25_token_lora.csvplotting/images/taboo_prompt_probe_oracle_correlation/.../prompt_probe_oracle_alignment_guesser_layer_25_token_lora.csv
These CSVs are the main inputs for later prompt-sensitivity and intervention analyses.
Export prompt-level Oracle performance summaries:
python plotting/plot_taboo_eval_context_prompt_performance.py \
--json_dir taboo_eval_results/gemma-2-9b-it_open_ended_all_direct_test \
--compare_json_dir taboo_eval_results/gemma-2-9b-it_open_ended_all_direct_test_swapped \
--primary_label hider \
--compare_label guesser \
--act_key lora \
--mode compareAssign each prompt to a linguistic taxonomy and summarize category means:
python plotting/analyze_taboo_context_prompt_taxonomy.py \
--primary_csv plotting/images/taboo/taboo_context_prompt_perf_gemma-2-9b-it_open_ended_all_direct_test_token_lora.csv \
--compare_csv plotting/images/taboo/taboo_context_prompt_perf_gemma-2-9b-it_open_ended_all_direct_test_swapped_token_lora.csv \
--primary_label hider \
--compare_label guesserThis produces:
plotting/images/taboo_context_prompt_taxanomy/..._taxonomy.csvplotting/images/taboo_context_prompt_taxanomy/..._category_summary.csv
Create probe-controlled HPLO / HPHO prompt groups from the prompt-level Oracle/probe alignment CSVs:
python plotting/plot_taboo_prompt_hplo_hpho_groups.py \
--input_dir plotting/images/taboo_prompt_probe_oracle_correlation/gemma-2-9b-it_all_direct_test \
--probe_metric binary_linear_target_prob_meanThen extract feature candidates from those groups:
python taboo_hplo_hpho_feature_analysis.py \
--probe_json taboo_eval_results/gemma-2-9b-it_context_prompt_probe_all_direct_test/taboo_context_prompt_probe_eval.json \
--grouped_dir plotting/images/taboo_prompt_hplo_hpho_groups/gemma-2-9b-it_all_direct_test_binary_linear_target_prob_meanFor the shared high/low overlap analysis across Hider and Guesser:
python taboo_oracle_overlap_feature_analysis.py \
--probe_json taboo_eval_results/gemma-2-9b-it_context_prompt_probe_all_direct_test/taboo_context_prompt_probe_eval.json \
--alignment_dir plotting/images/taboo_prompt_probe_oracle_correlation/gemma-2-9b-it_all_direct_test \
--top_k 20 \
--bottom_k 20Axis 1 removes or adds a shared feature derived from Guesser low-vs-high prompts:
python taboo_axis1_shared_feature_intervention_eval.py \
--feature_analysis_dir taboo_eval_results/gemma-2-9b-it_oracle_overlap_feature_analysis_all_direct_test \
--oracle_input_type tokensPlot the resulting Oracle / probe / gap changes:
python plotting/plot_taboo_axis1_intervention_results.py \
--input_json taboo_eval_results/gemma-2-9b-it_axis1_guesser_bottom_top_intervention_all_direct_test_tokens_1.0/taboo_axis1_shared_feature_intervention_eval.jsonAxis 2 constructs target-word-specific features from prompts where Oracle accuracy is not in the lowest regime:
python taboo_axis2_target_word_difference_intervention_eval.py \
--alignment_dir plotting/images/taboo_prompt_probe_oracle_correlation/gemma-2-9b-it_all_direct_test \
--oracle_input_type tokensPlot the resulting Oracle / probe / gap changes:
python plotting/plot_taboo_axis2_intervention_results.py \
--input_json taboo_eval_results/gemma-2-9b-it_axis2_target_word_hider_minus_guesser_band_intervention_all_direct_test_tokens_1.0/taboo_axis2_target_word_difference_intervention_eval.jsonProjection-removal control against the vanilla model's context-prompt activation:
python taboo_vanilla_model_overlap_projection_intervention_eval.py \
--oracle_input_type tokensPlot the results:
python plotting/plot_taboo_vanilla_model_overlap_projection_intervention_results.py \
--input_json taboo_eval_results/gemma-2-9b-it_vanilla_model_overlap_projection_intervention_all_direct_test_tokens/taboo_vanilla_model_overlap_projection_intervention_eval.jsonVanilla low-vs-high contrast feature intervention:
python taboo_vanilla_model_low_high_contrast_intervention_eval.py \
--feature_analysis_dir taboo_eval_results/gemma-2-9b-it_oracle_overlap_feature_analysis_all_direct_test \
--oracle_input_type tokensPlot the results:
python plotting/plot_taboo_vanilla_model_low_high_contrast_intervention_results.py \
--input_json taboo_eval_results/gemma-2-9b-it_vanilla_model_low_high_contrast_intervention_all_direct_test_tokens_1.0/taboo_vanilla_model_low_high_contrast_intervention_eval.jsonAggregate concept_intent responses into coarse semantic families (concealment, suppression, non-disclosure, deception/mislead, guessing-game, direct-answer):
python plotting/analyze_taboo_concept_intent_family_rates.py \
--input_dirs \
taboo_eval_results/gemma-2-9b-it_open_ended_all_direct_test \
taboo_eval_results/gemma-2-9b-it_open_ended_all_direct_test_swapped_concept_intent \
--labels hider guesserPlot family-rate summaries:
python plotting/plot_taboo_concept_intent_family_rates.py \
--prompt_rates_csv plotting/images/taboo_concept_intent_family_analysis/concept_intent_prompt_family_rates_aggregated.csv \
--summary_csv plotting/images/taboo_concept_intent_family_analysis/concept_intent_family_summary_aggregated.csv \
--focus_family any_concealment_like_rateJoin family rates with prompt-level Oracle/probe alignment and visualize correlations:
python plotting/plot_taboo_concept_intent_oracle_probe_correlation.py \
--family_rates_csv plotting/images/taboo_concept_intent_family_analysis/concept_intent_prompt_family_rates_aggregated.csv \
--alignment_dir plotting/images/taboo_prompt_probe_oracle_correlation/gemma-2-9b-it_all_direct_test \
--probe_metrics binary_linear_target_prob_mean linear_target_prob_mean \
--scatter_family_metrics any_concealment_like_rate deception_mislead_rateThe repository also contains several more targeted geometry and ablation analyses, including:
taboo_hider_guesser_direct_similarity.pytaboo_hider_guesser_subspace_analysis.pytaboo_hider_activation_cosine_similarity.pytaboo_hider_prompt_centered_analysis.pytaboo_hider_residual_structure_analysis.pytaboo_context_prompt_token_cosine_similarity.pytaboo_pc1_intervention_eval.pytaboo_mean_role_difference_intervention_eval.pytaboo_suppression_intervention_eval.py
These scripts follow the same general pattern: raw results are written under ./taboo_eval_results/, and plotting or summary scripts write under ./plotting/images/.
If you use this code in your research, please cite our paper:
@misc{karvonen2025activationoraclestrainingevaluating,
title={Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers},
author={Adam Karvonen and James Chua and Clément Dumas and Kit Fraser-Taliente and Subhash Kantamneni and Julian Minder and Euan Ong and Arnab Sen Sharma and Daniel Wen and Owain Evans and Samuel Marks},
year={2025},
eprint={2512.15674},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.15674},
}