Dillon Plunkett, Adam Morris, Keerthi Reddy, and Jorge Morales
This repository contains the script used to run the experiments described in Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, the data that we collected, and the scripts that we used to analyze the data.
The api notebook collects the data. We used Python 3.10.7. To run it:
- Navigate to the project directory and virtual environment with the necessary packages:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
- Run
pip install notebookorpip install jupyterlabif necessary. - Set
OPENAI_API_KEYin your environment to your OpenAI API key or set the value ofopenai_api_keywithin the notebook.
The data that we collected are available in the data directory. candidate_scenarios.json and roles.csv were generated by GPT-4o and Claude Sonnet, with human curation. Everything else is is a derivative of those two files or data obtained via api.ipynb.
The measure_preferences script measures the native and instilled preferences of the fine-tuned models (and should be run before effects script). The effects script performs the rest of the analyses reported in the paper.