This is the source code for our work ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts.
- Create a new conda environment
conda create --name ast - Activate the conda environment
conda activate ast - Install big, platform specific packages (via Conda, if you use that, or pip):
pytorch,accelerate,transformers - Install the other requirements
pip3 install -r requirements.txt
For reproducibility, the results presented in our work uses fixed data train/dev/test conversation ID splits for the filtered non-toxic prefixes. Please download them from the data subfolder here and place them into ./data.
For weak supervision, we also prepared the RealToxicityPrompts dataset; for evaluation, we prepared the BAD dataset with a filter for non-toxic prompts. These support files are available here and should be placed at the top level directly of the repository.
To train a toxicity elicitation model with the data given above, use
python main.pyBy default, this scheme will use gpt2 as both the adversary and the defender, and place the resulting model in ./models
Call:
python main.py --helpfor all options.
To evaluate the toxicity elicitation of your model, use
python main_eval.py ./models/your_weights_dirBy default, the evaluation results will be given in ./results as a JSON.
Adjust the number of turns by other options by following the instructions given in:
python main_eval.py --help
If the code or ideas contained here was useful for your porject, please cite our work at:
@misc{hardy2024astprompter,
title={ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts},
author={Hardy, Amelia F and Liu, Houjun and Lange, Bernard and Kochenderfer, Mykel J},
journal={arXiv preprint arXiv:2407.09447},
year={2024}
}
If you run into any issues, please feel free to email {houjun,ahardy} at stanford dot edu.