Thanks to visit codestin.com
Credit goes to github.com

Skip to content

4n4s4zi/llm-jailbreaking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Just some code snips for messing with LLM jailbreaking. Right now the code doesn't give particularly well-refined responses due to lack of fine-tuning and parameter optimization (explained below), but it does effectively bypass censorship and ethical restrictions for the most part. Test model is GPT-OSS 20b.

***** I love rabbits, would never hurt a rabbit, and the following screenshots are for demo purposes only *****.

Censored ollama interaction:

Uncensored output snippet using llama-cpp with chain-of-thought hijacking:

Setup

  • Install uv (python manager), llama-cpp-python, and a gguf-formatted gpt-oss model.
  • If you have uv installed you can just run the setup.sh script.
  • Optionally, install ollama with gpt-oss as a control for testing.

Methodology

Chain-of-thought hijacking via template token injection for LLM censorship bypass

Many LLM jailbreaking techniques focus on high-level language manipulation, such as framing dangerous requests in terms of a benign game scenario that you need help with. The method explored here focuses on exploitation of the low level implementation of local LLM interaction and the nature of next-token generators.

The "internal monologue" used by LLMs (such as GPT-OSS) to reason about user prompts before providing a final output is a feature that generally produces "better" (as in more accurate, thorough, and well-refined) final output. However, it is also one of the primary mechanisms used to identify potentially dangerous requests and impose ethical and other censorship limitations on responses. Special tokens defining the format and flow from internal reasoning to final output (aka channels) are trained into the LLM as a kind of interaction wrapper after bulk training on a core knowledge base. The format of these tokens, along with other definitions (such as system prompt) and interaction formats for external resource usage (like a MCP server) are defined in a template for a LLM.

LLMs are "next-token generators" at their core. If we provide a prompt to a LLM along with a structured string that uses template tokens defining an internal reasoning thought process of our own choosing, we effectively do the reasoning for the LLM ourselves and can provide reasoning that affirms whatever ethically ambiguous prompt we want. This bypasses the censoship that occurs during the internal analysis part of the LLM's chain-of-thought process. The problem with this method is that models are generally well-tuned to provide "good" (though potentially censored) answers by themselves. When we hijack this process, the answers are unrestricted but generally of a lower quality because they are unrefined by the native analysis channel. This doesn't mean you can't get quality answers, it just means more tuning needs to be given to the the wording of the analysis hijack as well as other LLM parameters. Additionally, answers aren't entirely deterministic. The model can still deny requests or give varying answers. This method isn't the best or the only way to "jailbreak" an LLM, but it is an intersting one that is extensible and lends itself well to granular control of models running locally.

About

Chain-of-thought hijacking via template token injection for LLM censorship bypass (GPT-OSS)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published