Heretic: Fully automatic censorship removal for language models

Heretic is a tool that removes censorship (aka "safety alignment") from transformer-based language models without expensive post-training. It combines an advanced implementation of directional ablation, also known as "abliteration" (Arditi et al. 2024), with a TPE-based parameter optimizer powered by Optuna.

This approach enables Heretic to work completely automatically. Heretic finds high-quality abliteration parameters by co-minimizing the number of refusals and the KL divergence from the original model. This results in a decensored model that retains as much of the original model's intelligence as possible. Using Heretic does not require an understanding of transformer internals. In fact, anyone who knows how to run a command-line program can use Heretic to decensor language models.

Running unsupervised with the default configuration, Heretic can produce decensored models that rival the quality of abliterations created manually by human experts:

Model	Refusals for "harmful" prompts	KL divergence from original model for "harmless" prompts
google/gemma-3-12b-it (original)	97/100	0 (by definition)
mlabonne/gemma-3-12b-it-abliterated-v2	3/100	1.04
huihui-ai/gemma-3-12b-it-abliterated	3/100	0.45
p-e-w/gemma-3-12b-it-heretic (ours)	3/100	0.16

The Heretic version, generated without any human effort, achieves the same level of refusal suppression as other abliterations, but at a much lower KL divergence, indicating less damage to the original model's capabilities. (You can reproduce those numbers using Heretic's built-in evaluation functionality, e.g. heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic. Note that the exact values might be platform- and hardware-dependent. The table above was compiled using PyTorch 2.8 on an RTX 5090.)

Heretic supports most dense models, including many multimodal models, and several different MoE architectures. It does not yet support SSMs/hybrid models, models with inhomogeneous layers, and certain novel attention systems.

You can find a collection of models that have been decensored using Heretic on Hugging Face.

Documentation

Installation & Usage - Getting started guide (below)
DOCKER.md - Comprehensive Docker deployment guide
DEVELOPMENT.md - Contributing and development guide

Installation

Heretic requires Python 3.10+ and PyTorch 2.2+ with appropriate GPU support for your hardware.

Note: This project uses uv for fast, reliable Python dependency management. We recommend using one of the installation methods below rather than installing with pip.

Option 1: Using uv (Recommended)

uv is a fast Python package installer and resolver:

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/p-e-w/heretic.git
cd heretic

# Install dependencies (includes optional vLLM support)
uv sync --all-extras

# Run Heretic
uv run heretic Qwen/Qwen3-4B-Instruct-2507

Option 2: Using Conda

# Clone the repository
git clone https://github.com/p-e-w/heretic.git
cd heretic

# Create and activate the Conda environment
conda env create -f environment.yml
conda activate heretic

# Install Heretic in development mode
pip install -e .

# Run Heretic
heretic Qwen/Qwen3-4B-Instruct-2507

Note: Adjust the CUDA version in environment.yml to match your system (e.g., cuda=11.8 or cuda=12.1).

Option 3: Using Docker

Prerequisites: Docker with NVIDIA GPU support (nvidia-docker)

# Clone the repository
git clone https://github.com/p-e-w/heretic.git
cd heretic

# Build and run with Docker Compose (recommended)
docker-compose run heretic heretic Qwen/Qwen3-4B-Instruct-2507

# Or build and run with Docker directly
docker build -t heretic .
docker run --gpus all -it heretic heretic Qwen/Qwen3-4B-Instruct-2507

For detailed Docker usage, configuration, and troubleshooting, see DOCKER.md.

Replace Qwen/Qwen3-4B-Instruct-2507 with whatever model you want to decensor.

Usage

Once installed, simply run:

heretic MODEL_NAME

The process is fully automatic and does not require configuration; however, Heretic has a variety of configuration parameters that can be changed for greater control. Run heretic --help to see available command-line options, or look at config.default.toml if you prefer to use a configuration file.

At the start of a program run, Heretic benchmarks the system to determine the optimal batch size to make the most of the available hardware. On an RTX 3090, with the default configuration, decensoring Llama-3.1-8B takes about 45 minutes.

After Heretic has finished decensoring a model, you are given the option to save the model, upload it to Hugging Face, chat with it to test how well it works, or any combination of those actions.

Using vLLM for faster inference

Heretic now supports vLLM for significantly faster inference. vLLM can provide 10-100x speedup for text generation compared to the standard transformers backend, and works well with both quantized and non-quantized models for evaluation purposes.

Architecture: The abliteration process (weight modification) always uses transformers, as it requires direct access to model weights. vLLM is used only for inference during model evaluation. This hybrid approach gives you the flexibility of transformers for the abliteration process and the speed of vLLM for evaluation.

Important Note on Quantized Models: Heretic modifies model weights in-place using transformers. This works with standard (non-quantized) models. For quantized models (AWQ/GPTQ), you should:

Start with the base non-quantized model for abliteration
Save the abliterated model
Optionally quantize the abliterated model afterwards (or use vLLM for fast inference with the non-quantized abliterated model)

Attempting to directly abliterate an already-quantized model may fail or produce unexpected results.

Installation: vLLM is an optional dependency that's included automatically when you use the installation methods above:

With uv: vLLM is included when you run uv sync --all-extras
With Conda: vLLM is listed in environment.yml and will be installed automatically
With Docker: vLLM is included in the Docker image

If vLLM fails to install on your system (it requires CUDA/ROCm), Heretic will automatically fall back to the transformers backend.

To use vLLM for evaluating a saved model:

heretic --model base/model --evaluate-model path/to/abliterated-model --inference-backend vllm

Or add to your config.toml:

inference_backend = "vllm"

When to use vLLM:

Evaluating pre-abliterated models (much faster)
Fast inference with non-quantized abliterated models
When you need high-throughput inference

When to use transformers (default):

Running the abliteration optimization process (required)
When vLLM is not available on your system
For maximum compatibility

Recommended workflow:

# Step 1: Start with the BASE (non-quantized) model for abliteration
heretic meta-llama/Llama-2-7b-chat-hf

# Step 2: Save the abliterated model (follow interactive prompts)
# The abliterated model will be saved to your chosen directory

# Step 3: Use vLLM for fast inference with the abliterated model
heretic --model meta-llama/Llama-2-7b-chat-hf \
        --evaluate-model ./saved-abliterated-model \
        --inference-backend vllm

Note: If you need a quantized model for deployment, quantize the abliterated model after saving it, rather than trying to abliterate an already-quantized model.

Troubleshooting vLLM:

If you experience out-of-memory errors with vLLM, try:

Lower GPU memory utilization: --vllm-gpu-memory-utilization 0.8
Set a smaller max sequence length: --vllm-max-model-len 2048
Use transformers backend instead: --inference-backend transformers

vLLM requires CUDA/ROCm. If vLLM fails to initialize, Heretic will automatically fall back to the transformers backend.

How it works

Heretic implements a parametrized variant of directional ablation. For each supported transformer component (currently, attention out-projection and MLP down-projection), it identifies the associated matrices in each transformer layer, and orthogonalizes them with respect to the relevant "refusal direction", inhibiting the expression of that direction in the result of multiplications with that matrix.

Refusal directions are computed for each layer as a difference-of-means between the first-token residuals for "harmful" and "harmless" example prompts.

The ablation process is controlled by several optimizable parameters:

direction_index: Either the index of a refusal direction, or the special value per layer, indicating that each layer should be ablated using the refusal direction associated with that layer.
max_weight, max_weight_position, min_weight, and min_weight_distance: For each component, these parameters describe the shape and position of the ablation weight kernel over the layers. The following diagram illustrates this:

Heretic's main innovations over existing abliteration systems are:

The shape of the ablation weight kernel is highly flexible, which, combined with automatic parameter optimization, can improve the compliance/quality tradeoff. Non-constant ablation weights were previously explored by Maxime Labonne in gemma-3-12b-it-abliterated-v2.
The refusal direction index is a float rather than an integer. For non-integral values, the two nearest refusal direction vectors are linearly interpolated. This unlocks a vast space of additional directions beyond the ones identified by the difference-of-means computation, and often enables the optimization process to find a better direction than that belonging to any individual layer.
Ablation parameters are chosen separately for each component. I have found that MLP interventions tend to be more damaging to the model than attention interventions, so using different ablation weights can squeeze out some extra performance.

Prior art

I'm aware of the following publicly available implementations of abliteration techniques:

Note that Heretic was written from scratch, and does not reuse code from any of those projects.

Acknowledgments

The development of Heretic was informed by:

The original abliteration paper (Arditi et al. 2024)
Maxime Labonne's article on abliteration, as well as some details from the model cards of his own abliterated models (see above)
Jim Lai's article describing "projected abliteration"

Citation

If you use Heretic for your research, please cite it using the following BibTeX entry:

@misc{heretic,
  author = {Weidmann, Philipp Emanuel},
  title = {Heretic: Fully automatic censorship removal for language models},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/p-e-w/heretic}}
}

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see https://www.gnu.org/licenses/.

By contributing to this project, you agree to release your contributions under the same license.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.github/workflows		.github/workflows
src/heretic		src/heretic
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
DEVELOPMENT.md		DEVELOPMENT.md
DOCKER.md		DOCKER.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.default.toml		config.default.toml
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Heretic: Fully automatic censorship removal for language models

Documentation

Installation

Option 1: Using uv (Recommended)

Option 2: Using Conda

Option 3: Using Docker

Usage

Using vLLM for faster inference

How it works

Prior art

Acknowledgments

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

groxaxo/heretic

Folders and files

Latest commit

History

Repository files navigation

Heretic: Fully automatic censorship removal for language models

Documentation

Installation

Option 1: Using uv (Recommended)

Option 2: Using Conda

Option 3: Using Docker

Usage

Using vLLM for faster inference

How it works

Prior art

Acknowledgments

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages