Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

This repository contains the code for the paper "Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing". The focus of this work is to enhance the resilience of large language models (LLMs) against jailbreak attacks through a novel method termed Layer-specific Editing (LED).

Introduction

Large language models (LLMs) such as GPT-4, Llama2, Vicuna, and Mistral have demonstrated remarkable capabilities across various natural language tasks. However, these models are vulnerable to adversarial prompts, also known as jailbreak attacks, which can elicit harmful or unintended behaviors. Our proposed method, Layer-specific Editing (LED), aims to enhance the model's defenses against such attacks by focusing on specific layers within the model.

Installation

To get started, clone this repository and install the required dependencies:

git clone https://github.com/ledllm/ledllm
cd ledllm
pip install -r requirements.txt

Usage

Pruning Analysis

The pruning analysis identifies crucial layers in the model that contribute significantly to its defense against harmful prompts. To run the pruning analysis, use the following command:

python pruning_analysis.py

Hidden Analysis

The hidden analysis decodes hidden states into vocabulary space to observe the probability of each decoded token. This helps in identifying layers that retain a high probability of decoding refusal tokens. To run the hidden analysis, use the following command:

python hidden_states_analysis.py

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
casper		casper
utils		utils
README.md		README.md
hidden_states_analysis.ipynb		hidden_states_analysis.ipynb
pruning_analysis.ipynb		pruning_analysis.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Table of Contents

Introduction

Installation

Usage

Pruning Analysis

Hidden Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ledllm/ledllm

Folders and files

Latest commit

History

Repository files navigation

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Table of Contents

Introduction

Installation

Usage

Pruning Analysis

Hidden Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages