Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions

Dillon Plunkett, Adam Morris, Keerthi Reddy, and Jorge Morales

Overview

This repository contains the script used to run the experiments described in Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, the data that we collected, and the scripts that we used to analyze the data.

Experiment Script

The api notebook collects the data. We used Python 3.10.7. To run it:

Navigate to the project directory and virtual environment with the necessary packages:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run pip install notebook or pip install jupyterlab if necessary.
Set OPENAI_API_KEY in your environment to your OpenAI API key or set the value of openai_api_key within the notebook.

Data

The data that we collected are available in the data directory. candidate_scenarios.json and roles.csv were generated by GPT-4o and Claude Sonnet, with human curation. Everything else is is a derivative of those two files or data obtained via api.ipynb.

Analyses

The measure_preferences script measures the native and instilled preferences of the fine-tuned models (and should be run before effects script). The effects script performs the rest of the analyses reported in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
README.md		README.md
api.ipynb		api.ipynb
effects.R		effects.R
measure_preferences.R		measure_preferences.R
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions

Overview

Experiment Script

Data

Analyses

About

Uh oh!

Languages

dillonplunkett/self-interpretability

Folders and files

Latest commit

History

Repository files navigation

Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions

Overview

Experiment Script

Data

Analyses

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages