Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Scripts and data for Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training.

Notifications You must be signed in to change notification settings

dillonplunkett/self-interpretability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions

Dillon Plunkett, Adam Morris, Keerthi Reddy, and Jorge Morales

Overview

This repository contains the script used to run the experiments described in Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, the data that we collected, and the scripts that we used to analyze the data.

Experiment Script

The api notebook collects the data. We used Python 3.10.7. To run it:

  1. Navigate to the project directory and virtual environment with the necessary packages:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
  1. Run pip install notebook or pip install jupyterlab if necessary.
  2. Set OPENAI_API_KEY in your environment to your OpenAI API key or set the value of openai_api_key within the notebook.

Data

The data that we collected are available in the data directory. candidate_scenarios.json and roles.csv were generated by GPT-4o and Claude Sonnet, with human curation. Everything else is is a derivative of those two files or data obtained via api.ipynb.

Analyses

The measure_preferences script measures the native and instilled preferences of the fine-tuned models (and should be run before effects script). The effects script performs the rest of the analyses reported in the paper.

About

Scripts and data for Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training.

Resources

Stars

Watchers

Forks