Gnosis is a lightweight self-awareness head attached to a (frozen) LLM backbone that predicts a scalar correctness probability for a generated response by reading the model’s hidden states + attention maps.
-
transformers/— local Transformers fork with Gnosis integrated into the model architecture- Implemented under:
transformers/src/transformers/models/gpt_osstransformers/src/transformers/models/qwen3
- Implemented under:
-
trl/— local TRL fork with a modifiedSFTTrainerto train the Gnosis head- Key change:
trl/trl/trainer/sft_trainer.py
- Key change:
-
open-r1/— training code + configs -
src/— inference + data tools (quickstart, scoring, preprocessing scripts)
From repo root:
chmod +x scripts/setup_gnosis_env.sh
bash scripts/setup_gnosis_env.sh
conda activate Gnosisconda create -n Gnosis python=3.11 -y
conda activate Gnosis
pip install --upgrade pip wheel setuptools
pip install vllm==0.8.5.post1
python - <<'PY'
import torch; print("Torch:", torch.__version__)
PY
pip install flash-attn --no-build-isolation
pip uninstall -y transformers || true
pip install -e ./transformers
pip install -e "./trl[vllm]"
cd open-r1
GIT_LFS_SKIP_SMUDGE=1 pip install -e ".[dev]" --no-deps
cd ..
python - <<'PY'
import pathlib, transformers, trl
print("transformers →", pathlib.Path(transformers.__file__).resolve())
print("trl →", pathlib.Path(trl.__file__).resolve())
PY
export TOKENIZERS_PARALLELISM=falseIn this example, we first generate a solution for a single question (via vLLM or HF generation), then run Gnosis on (prompt + answer) to output a scalar correctness probability.
Task options: math, trivia, mmlu_pro
math / reasoning→ step-by-step; final in\boxed{}trivia→ short factoid; final in\boxed{}mmlu_pro→ multiple-choice; final is only the letter in\boxed{}
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from vllm import LLM
from src.demo import (
build_chat_prompt,
make_vllm_sampling_params,
generate_with_vllm,
generate_with_hf,
correctness_prob,
)
GNOSIS_MODEL_ID = "Trained_gnosis_model"
VLLM_MODEL_ID = "Qwen/Qwen3-1.7B"
USE_VLLM = False
SYSTEM_PROMPTS = {
"math": "Please reason step by step, and put your final answer within \\boxed{}.",
"trivia": "This is a trivia question. Put your final answer within \\boxed{}.",
"mmlu_pro": "You are solving multiple-choice questions. Please reason step by step, and put your final answer with only the choice letter within \\boxed{}."
}
tokenizer = AutoTokenizer.from_pretrained(GNOSIS_MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
GNOSIS_MODEL_ID,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).cuda().eval()
prompt = build_chat_prompt(
tokenizer,
question="How many r's are in strawberry?",
system_prompt=SYSTEM_PROMPTS["math"],
)
if USE_VLLM:
llm = LLM(
VLLM_MODEL_ID,
**{
"tensor_parallel_size": 1,
"max_model_len": 12000,
"dtype": "bfloat16",
"gpu_memory_utilization": 0.50,
"trust_remote_code": True,
},
)
sp = make_vllm_sampling_params(temperature=0.6, top_p=0.95, max_tokens=10_000)
answer = generate_with_vllm(llm, prompt, sp)
else:
answer = generate_with_hf(
model, tokenizer, prompt, torch.device("cuda"),
max_new_tokens=10_000, temperature=0.6, top_p=0.95
)
p_correct = correctness_prob(
model, tokenizer, prompt + answer, torch.device("cuda"), max_len_for_scoring=None
)
print("Answer:\n", answer)
print("Gnosis correctness probability:", f"{p_correct:.4f}")Training begins with a simple pipeline: generate model completions (per dataset/benchmark) → verify them into binary correctness labels → merge + rebalance tasks (e.g., math + trivia) into one SFT-ready Parquet dataset.
➡️ Full, step-by-step instructions is provided in DATA_PREPROCESS.md.
Training configs live under:
open-r1/recipes/training/(per-backbone YAMLs, e.g., Qwen3 / GPT-OSS, etc.)
Example config:
open-r1/recipes/training/Qwen3/Qwen3-1.7B_hybrid_gnosis.yaml
To train:
accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
src/open_r1/sft.py \
--config recipes/training/Qwen3/Qwen3-1.7B_hybrid_gnosis.yamlNote: This setup is currently configured for 2× A100 GPUs. Adjust the Accelerate/DeepSpeed config (and batch sizes, gradient accumulation, etc.) to match your available hardware.
We provide a convenience wrapper script to run the scorer on multiple benchmark shard directories (e.g., Math / TriviaQA / MMLU-Pro) and write all outputs under one folder.
Script: src/evaluation/scripts/Gnosis_run_all_scoring.sh
It calls: src/evaluation/score_completions_Gnosis_outputscores_script_version.py
MODEL="Path_to_trained_gnosis_backbone"MATH10_DIR=...,TRIVIA_DIR=...,MMLUPRO_DIR=...(dirs that containshard-*.parquet)OUT_BASE="outputs/scored_runs/Gnosis"
bash src/evaluation/scripts/Gnosis_run_all_scoring.shOutputs are saved in:
outputs/scored_runs/Gnosis/scored/<model_name>/
If you find our work useful, please consider citing our paper in your research.
@article{ghasemabadi2025can,
title={Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits},
author={Ghasemabadi, Amirhosein and Niu, Di},
journal={arXiv preprint arXiv:2512.20578},
year={2025}
}