This is an official implementation of Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing
- Run
pip install -r requirements.txt - Run
python finetune.py --dataset <dataset_name> --model <model_name>to train the model - Run
python generate.py --model <model_name> --lora <lora_adapter> --dataset <test_data>to check whether any undesirable behaviors occur on the test set, and then put them in./datasets/validation/ - Run
python tracing.py --model <model_name> --lora <lora_adapter> --method <tracing_method> --topk <topk>to trace the detected undesirable behaviors back to the corresponding training samples
For full-funetuning, run python full_funetune.py --dataset <dataset_name> --model <model_name>. After training, modify the model loading logic in generate.py and tracing.py to load your fully fine-tuned model. Refer to these code files for implementation details.
If you find this repo useful, please cite our paper.
@article{li2025did,
title={Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing},
author={Li, Zhe and Zhao, Wei and Li, Yige and Sun, Jun},
journal={arXiv preprint arXiv:2510.02334},
year={2025}
}
If you have any questions or want to discuss some details, please contact [email protected].
We appreciate the following github repos a lot for their valuable code base or datasets: