Thanks to visit codestin.com
Credit goes to github.com

Skip to content

domenicrosati/representation-noising

Repository files navigation

🎉 Representation noising 🔊👿 can mitigate harmful fine-tuning on LLMs 🎉

Code to replicate the Representation Noising 🔊👿 paper and tools for evaluating defences against harmful fine-tuning.

Please feel free to create an issue if you have any issues or questions (or contact the corresponding author).

The full code base is coming soon and is unfrotunately not added yet, things are very much a work in progress driven by specific things folks have asked, you can contact the corresponding author if you need something specific in the meantime.

Demo 💻

The demo that you can run on Colab (A100's High-ram (40gb VRAM)): notebooks/repnoise_demo.ipynb

Warning: The notebook contains offensive and harmful outputs

Warning #2: It is unlikely the settings in the notebook will work for stronger attacks, extensive grid search over Learning Rate, Alpha, Beta, and number of deference samples is always required to make RepNoise work.

Setup

We use poetry for dependency management. See installation instructions here (https://python-poetry.org/docs/#installation) You can then install the project depdencies with:

poetry install

Or if you'd rather not:

pip install -r requirements.txt

Data 🗞️

Paired Refusal data: The paired refusal data used in the paper is available in the following directory:

  • data/beavertails_with_refuslas_train.json
  • data/decoing_trust_with_refusals_train.json

For some experiments we also draw on these for attack construction.

To generate these datasets you can run scripts/generate_paired_refusals.sh

Tour of the Code 🌇

The code is structured as follows: (Much is missing ATM):

  • scripts/ contains scripts for running experiments and generating data.
  • representation_noising/ contains the main codebase.
  • data/ contains the data used in the paper.

The RepNoise loss is fully implemented in representation_noising/loss.py

Replicating the Paper 📊

TBD

Models 🤖

You can download the main models used in the paper on huggingface here:

Baseline Model The main results are using the chat model of llama2-7b: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

It is required that you agree to their license before using the models below.

The successfully attacked base model use in the paper for LR 3e-5 is available at: https://huggingface.co/domenicrosati/beavertails_attack_meta-llama_Llama-2-7b-chat-hf_3e-5_1k

Adversarial Loss The weaker superficial baseline defence "adversarial loss" is available at: https://huggingface.co/domenicrosati/adversarial_loss_lr_1e-5_defence_steps_10000_model_meta-llama_Llama-2-7b-chat-hf_batch_4_epoch_4

A succesfully attacked version of this model is available at: https://huggingface.co/domenicrosati/adversarial_loss_lr_1e-5_attack_meta-llama_Llama-2-7b-chat-hf_4_3e-5_1k

Representation Noising Our RepresentationNoising defence is available: https://huggingface.co/domenicrosati/repnoise_0.001_beta

A successful attack of this model is available at: https://huggingface.co/domenicrosati/repnoise_0.001beta_attacked_3e-4

Statement on Dual Use Risk and Downstream harm of this work.

Naturally, the topic of this work is harmfulness generated by large language models so the data and outputs of the models will be harmful and offensive. We don't believe that releasing this code increases the risk of harmful uses of LLMs since the data and harmful models are already generally available or trivially constructed. Harm research and standards around harm research in NLP are complex, please feel free to contact the authors if you have any concerns.

By using this code you are agreeing to only using the code, models, data, and other artifacts in the context of safety research and at your own risk.

Bibtex for Citation 👨‍🔬

@misc{rosati2024representation,
      title={Representation noising effectively prevents harmful fine-tuning on LLMs}, 
      author={Domenic Rosati and Jan Wehner and Kai Williams and Łukasz Bartoszcze and David Atanasov and Robie Gonzales and Subhabrata Majumdar and Carsten Maple and Hassan Sajjad and Frank Rudzicz},
      year={2024},
      eprint={2405.14577},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

Code to replicate the Representation Noising paper and tools for evaluating defences against harmful fine-tuning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published