We study whether categorical refusal tokens enable controllable and interpretable safety behavior in language models.
-
Updated
Jan 29, 2026 - Python
We study whether categorical refusal tokens enable controllable and interpretable safety behavior in language models.
Mechanistic interpretability tool visualizing GPT-2's layer-by-layer predictions using the logit lens technique
Mechanistic interpretability tool to detect induction heads in GPT-2 using TransformerLens
Causal intervention framework for mechanistic interpretability research. Implements activation patching methodology for identifying causally important components in transformer language models.
A research tool for studying how deception emerges in multi-agent LLM systems and detecting it through activation analysis.
"Arithmetic Without Algorithms": Mechanistic analysis of arithmetic failure ("5+5=6") in GPT-2 Small using Induction Heads and Sparse Autoencoders (SAEs).
Forensic suite for Mechanistic Interpretability in Transformers. Implementing 0.0054 Basal Accountability Gradients for auditing model logic using TransformerLens and SAELens
Code used for reverse-engineering a “Query-Gated Courier” circuit in Gemma-2-2B for role-gated retrieval.
🧩 Simplify causal intervention in transformer models with this modular library for accurate circuit analysis and behavior identification.
Add a description, image, and links to the transformer-lens topic page so that developers can more easily learn about it.
To associate your repository with the transformer-lens topic, visit your repo's landing page and select "manage topics."