Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

Overview of our Gnosis self-awareness mechanism and its performance.

Gnosis is a lightweight self-awareness head attached to a (frozen) LLM backbone that predicts a scalar correctness probability for a generated response by reading the model’s hidden states + attention maps.

📁 Repository layout

transformers/ — local Transformers fork with Gnosis integrated into the model architecture
- Implemented under:
  - transformers/src/transformers/models/gpt_oss
  - transformers/src/transformers/models/qwen3
trl/ — local TRL fork with a modified SFTTrainer to train the Gnosis head
- Key change:
  - trl/trl/trainer/sft_trainer.py
open-r1/ — training code + configs
src/ — inference + data tools (quickstart, scoring, preprocessing scripts)

🧩 Installation

✅ Option A: One-command setup

From repo root:

chmod +x scripts/setup_gnosis_env.sh
bash scripts/setup_gnosis_env.sh
conda activate Gnosis

🛠️ Option B: Manual install (exact steps)

conda create -n Gnosis python=3.11 -y
conda activate Gnosis

pip install --upgrade pip wheel setuptools
pip install vllm==0.8.5.post1

python - <<'PY'
import torch; print("Torch:", torch.__version__)
PY

pip install flash-attn --no-build-isolation

pip uninstall -y transformers || true
pip install -e ./transformers
pip install -e "./trl[vllm]"

cd open-r1
GIT_LFS_SKIP_SMUDGE=1 pip install -e ".[dev]" --no-deps
cd ..

python - <<'PY'
import pathlib, transformers, trl
print("transformers →", pathlib.Path(transformers.__file__).resolve())
print("trl          →", pathlib.Path(trl.__file__).resolve())
PY

export TOKENIZERS_PARALLELISM=false

⚡ Quickstart: Use Gnosis on a single question

In this example, we first generate a solution for a single question (via vLLM or HF generation), then run Gnosis on (prompt + answer) to output a scalar correctness probability.

Task options: math, trivia, mmlu_pro

math / reasoning → step-by-step; final in \boxed{}
trivia → short factoid; final in \boxed{}
mmlu_pro → multiple-choice; final is only the letter in \boxed{}

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from vllm import LLM
from src.demo import (
    build_chat_prompt,
    make_vllm_sampling_params,
    generate_with_vllm,
    generate_with_hf,
    correctness_prob,
)

GNOSIS_MODEL_ID = "Trained_gnosis_model"
VLLM_MODEL_ID = "Qwen/Qwen3-1.7B"
USE_VLLM = False

SYSTEM_PROMPTS = {
    "math": "Please reason step by step, and put your final answer within \\boxed{}.",
    "trivia": "This is a trivia question. Put your final answer within \\boxed{}.",
    "mmlu_pro": "You are solving multiple-choice questions. Please reason step by step, and put your final answer with only the choice letter within \\boxed{}."
}

tokenizer = AutoTokenizer.from_pretrained(GNOSIS_MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    GNOSIS_MODEL_ID,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).cuda().eval()

prompt = build_chat_prompt(
    tokenizer,
    question="How many r's are in strawberry?",
    system_prompt=SYSTEM_PROMPTS["math"],
)

if USE_VLLM:
    llm = LLM(
        VLLM_MODEL_ID,
        **{
            "tensor_parallel_size": 1,
            "max_model_len": 12000,
            "dtype": "bfloat16",
            "gpu_memory_utilization": 0.50,
            "trust_remote_code": True,
        },
    )
    sp = make_vllm_sampling_params(temperature=0.6, top_p=0.95, max_tokens=10_000)
    answer = generate_with_vllm(llm, prompt, sp)
else:
    answer = generate_with_hf(
        model, tokenizer, prompt, torch.device("cuda"),
        max_new_tokens=10_000, temperature=0.6, top_p=0.95
    )

p_correct = correctness_prob(
    model, tokenizer, prompt + answer, torch.device("cuda"), max_len_for_scoring=None
)

print("Answer:\n", answer)
print("Gnosis correctness probability:", f"{p_correct:.4f}")

🏋️ Training Gnosis

🧪 Step 1 — Data generation

Training begins with a simple pipeline: generate model completions (per dataset/benchmark) → verify them into binary correctness labels → merge + rebalance tasks (e.g., math + trivia) into one SFT-ready Parquet dataset.

➡️ Full, step-by-step instructions is provided in DATA_PREPROCESS.md.

🚀 Step 2 — Train with `open-r1`

Training configs live under:

open-r1/recipes/training/ (per-backbone YAMLs, e.g., Qwen3 / GPT-OSS, etc.)

Example config:

open-r1/recipes/training/Qwen3/Qwen3-1.7B_hybrid_gnosis.yaml

To train:

accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
  src/open_r1/sft.py \
  --config recipes/training/Qwen3/Qwen3-1.7B_hybrid_gnosis.yaml

Note: This setup is currently configured for 2× A100 GPUs. Adjust the Accelerate/DeepSpeed config (and batch sizes, gradient accumulation, etc.) to match your available hardware.

📊 Evalution

We provide a convenience wrapper script to run the scorer on multiple benchmark shard directories (e.g., Math / TriviaQA / MMLU-Pro) and write all outputs under one folder.

Script: src/evaluation/scripts/Gnosis_run_all_scoring.sh It calls: src/evaluation/score_completions_Gnosis_outputscores_script_version.py

Edit these paths

MODEL="Path_to_trained_gnosis_backbone"
MATH10_DIR=..., TRIVIA_DIR=..., MMLUPRO_DIR=... (dirs that contain shard-*.parquet)
OUT_BASE="outputs/scored_runs/Gnosis"

Run

bash src/evaluation/scripts/Gnosis_run_all_scoring.sh

Outputs are saved in: outputs/scored_runs/Gnosis/scored/<model_name>/

Star History

Citation

If you find our work useful, please consider citing our paper in your research.

@article{ghasemabadi2025can,
  title={Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits},
  author={Ghasemabadi, Amirhosein and Niu, Di},
  journal={arXiv preprint arXiv:2512.20578},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
open-r1		open-r1
scripts		scripts
src		src
transformers		transformers
trl		trl
.gitignore		.gitignore
DATA_PREPROCESS.md		DATA_PREPROCESS.md
README.md		README.md
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

📁 Repository layout

🧩 Installation

✅ Option A: One-command setup

🛠️ Option B: Manual install (exact steps)

⚡ Quickstart: Use Gnosis on a single question

🏋️ Training Gnosis

🧪 Step 1 — Data generation

🚀 Step 2 — Train with `open-r1`

📊 Evalution

Edit these paths

Run

Star History

Citation

About

Uh oh!

Releases

Packages

Languages

Amirhosein-gh98/Gnosis

Folders and files

Latest commit

History

Repository files navigation

Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

📁 Repository layout

🧩 Installation

✅ Option A: One-command setup

🛠️ Option B: Manual install (exact steps)

⚡ Quickstart: Use Gnosis on a single question

🏋️ Training Gnosis

🧪 Step 1 — Data generation

🚀 Step 2 — Train with open-r1

📊 Evalution

Edit these paths

Run

Star History

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🚀 Step 2 — Train with `open-r1`

Packages