transformerlens

Star

Here are 10 public repositories matching this topic...

yash-srivastava19 / arrakis

Sponsor

Star

Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.

transformer garcon explainable-ai mechanistic-interpretability anthropic transformerlens

Updated Apr 22, 2025
Jupyter Notebook

FarnoushRJ / RelP

Star

[NeurIPS 2025 MechInterp Workshop - Spotlight] Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching"

language-model circuit-analysis interpretability explainable-ai interpretable-machine-learning explainability llms mechanistic-interpretability transformerlens neurips-2025

Updated Nov 3, 2025
Python

krnel-ai / krnel-graph

Star

Lightweight representation engineering dataflow operations for agent developers.

transformers pytorch dataflow parquet huggingface huggingface-transformers duckdb pylance mechanistic-interpretability lancedb transformerlens representation-engineering pragmatic-interpretability

Updated Dec 18, 2025
Python

ashioyajotham / exploring_saes

Star

Implementation and analysis of Sparse Autoencoders for neural network interpretability research. Features interactive visualization dashboard and W&B integration.

sparse-autoencoders interpretability activation-functions neuron-activity wandb transformerlens mech-interp

Updated Nov 21, 2025
Python

stchakwdev / Pinocchio-Vector-Test

Star

Investigating whether language models encode anticipated social consequences in their activations. Uses a 2x2 factorial design crossing truth × social valence to show that models are more sensitive to expected approval/disapproval than to truth itself.

language-models ai-safety interpretability deception-detection mechanistic-interpretability transformerlens