This repository contains the code for the paper "Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing". The focus of this work is to enhance the resilience of large language models (LLMs) against jailbreak attacks through a novel method termed Layer-specific Editing (LED).
Large language models (LLMs) such as GPT-4, Llama2, Vicuna, and Mistral have demonstrated remarkable capabilities across various natural language tasks. However, these models are vulnerable to adversarial prompts, also known as jailbreak attacks, which can elicit harmful or unintended behaviors. Our proposed method, Layer-specific Editing (LED), aims to enhance the model's defenses against such attacks by focusing on specific layers within the model.
To get started, clone this repository and install the required dependencies:
git clone https://github.com/ledllm/ledllm
cd ledllm
pip install -r requirements.txtThe pruning analysis identifies crucial layers in the model that contribute significantly to its defense against harmful prompts. To run the pruning analysis, use the following command:
python pruning_analysis.pyHidden Analysis
The hidden analysis decodes hidden states into vocabulary space to observe the probability of each decoded token. This helps in identifying layers that retain a high probability of decoding refusal tokens. To run the hidden analysis, use the following command:
python hidden_states_analysis.py