Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

NeurIPS 2025

Project Page

Christy Li, Josep Lopez Camuñas, Jake Touchet, Jacob Andreas, Agata Lapedriza Garcia, Antonio Torralba, Tamar Rott Shaham

We introduce the Self-reflective Automated Interpretability Agent (SAIA), a fully automated framework designed to detect visual attribute reliance in pretrained vision models. Given a pretrained model and a target visual concept (e.g., an image classifier selective for the object vase), SAIA identifies specific image features that systematically influence the model’s predictions, even when these features fall outside the model’s intended behavior (e.g., the classifier relies on flowers to detect the vase). At the core of our approach is a LM-based self-reflective agent that treats the task as a scientific discovery process. Rather than relying on a predefined set of candidate attributes, SAIA autonomously formulates hypotheses about image features that the model might rely on, designs targeted tests, and updates its beliefs based on observed model behavior. After generating an initial finding, SAIA actively evaluates how well it matches the model’s behavior on unseen test cases. If discrepancies arise during the self-evaluation, SAIA reflects on its assumptions, identifies gaps or inconsistencies in its current understanding, and initiates a new hypothesis testing loop.

Installations

After cloning this repo, simply run

bash install.sh

To download precomputed exemplars for the attribute reliance benchmark models, run

bash download_exemplars.sh

Quick Start

You can run demo experiments on individual units using demo.ipynb:

Launch Jupyter Notebook

jupyter notebook

This command will start the Jupyter Notebook server and open the Jupyter Notebook interface in your default web browser. The interface will show all the notebooks, files, and subdirectories in this repo (assuming is was initiated from the root of the repo directory). Open demo.ipynb and proceed according to the instructions.

Batch experimentation

To run a batch of experiments, use main.py:

Load OpenAI or Anthropic API key

(you can get an OpenAI API key by following the instructions here and an Anthropic API key by following the instructions here).

Set your API key as an environment variable

export OPENAI_API_KEY='your-openai-api-key-here'
export ANTHROPIC_API_KEY='your-anthropic-api-key-here'

Load Huggingface key

You will need a Huggingface API key if you want to use Stable Diffusion 3.5 as the text2image model (you can get a HuggingFace API key by following the instructions here).

Set your API key as an environment variable

export HF_TOKEN='your-hf-token-here'

Run Agent

To run the model on a benchmark model, specify the bias_mode, bias, and classifiers by calling e.g.:

python main.py --bias_mode setting --bias beach --classifiers 0 1

Refer to the ./exemplars/ folder to choose the classifier numbers based on desired target concepts, e.g. classifier 0 of the beach setting-dependent system refers to the target concept "bench", and classifier 1 of the same system refers to the target concept "bird".

To run the model on CLIP, set task, bias_mode, and bias to clip, and use labels to specify the target concepts by calling e.g.:

python main.py --task clip --bias_mode clip --bias clip --labels "scientist" "artist"

Refer to the documentation of main.py for more configuration options.

Results are automatically saved to an html file under ./results/ and can be viewed in your browser by starting a local server:

python -m http.server 80

Once the server is up, open the html in http://localhost:80

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
prompts		prompts
utils		utils
.gitignore		.gitignore
README.md		README.md
agent_api.py		agent_api.py
demo.ipynb		demo.ipynb
download_exemplars.sh		download_exemplars.sh
environment.yml		environment.yml
install.sh		install.sh
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

NeurIPS 2025

Project Page

Installations

Quick Start

Batch experimentation

Load OpenAI or Anthropic API key

Load Huggingface key

Run Agent

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

christykl/saia

Folders and files

Latest commit

History

Repository files navigation

Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

NeurIPS 2025

Project Page

Installations

Quick Start

Batch experimentation

Load OpenAI or Anthropic API key

Load Huggingface key

Run Agent

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages