Post-hoc calibration without retraining for large language models. This toolkit turns a raw prompt into:
- a bounded hallucination risk using the Expectation-level Decompression Law (EDFL), and
- a decision to ANSWER or REFUSE under a target SLA, with transparent math (nats).
- Multi-Provider Support: Works with OpenAI, Anthropic (Claude), Hugging Face, and Ollama models
- No Retraining Required: Pure inference-time calibration
- Two Deployment Modes:
- Evidence-based: prompts include evidence/context; rolling priors are built by erasing that evidence
- Closed-book: prompts have no evidence; rolling priors are built by semantic masking of entities/numbers/titles
 
- Mathematically Grounded: Based on EDFL/B2T/ISR framework from NeurIPS 2024 preprint
- Install & Setup
- Supported Model Providers
- Quick Start Examples
- Core Mathematical Framework
- Understanding System Behavior
- Two Ways to Build Rolling Priors
- API Surface
- Calibration & Validation
- Practical Considerations
- Project Layout
- Deployment Options
# Core requirement
pip install --upgrade openai
# For additional providers (optional)
pip install anthropic              # For Claude models
pip install transformers torch     # For local Hugging Face models
pip install ollama                 # For Ollama models
pip install requests               # For HTTP-based backends# For OpenAI
export OPENAI_API_KEY=sk-...
# For Anthropic (Claude)
export ANTHROPIC_API_KEY=sk-ant-...
# For Hugging Face Inference API
export HF_TOKEN=hf_...The toolkit now supports multiple LLM providers through universal backend adapters:
- GPT-4o, GPT-4o-mini, and other Chat Completions models
- Requires OPENAI_API_KEY
- Claude 3.5 Sonnet, Claude 3 Opus, and other Claude models
- Requires anthropicpackage andANTHROPIC_API_KEY
- Local Transformers: Run models locally with transformers
- TGI Server: Connect to Text Generation Inference servers
- Inference API: Use hosted models via Hugging Face API
- Run any Ollama-supported model locally
- Supports both Python SDK and HTTP API
- Single API for 100+ models from OpenAI, Anthropic, Google, Meta, and more
- Automatic fallbacks and load balancing
- Often cheaper than direct API access due to volume aggregation
- Built-in rate limiting and retry logic
from hallucination_toolkit import OpenAIBackend, OpenAIItem, OpenAIPlanner
backend = OpenAIBackend(model="gpt-4o-mini")
planner = OpenAIPlanner(backend, temperature=0.3)
item = OpenAIItem(
    prompt="Who won the 2019 Nobel Prize in Physics?",
    n_samples=7,
    m=6,
    skeleton_policy="closed_book"
)
metrics = planner.run(
    [item], 
    h_star=0.05,           # Target 5% hallucination max
    isr_threshold=1.0,     # Standard ISR gate
    margin_extra_bits=0.2, # Safety margin
    B_clip=12.0,          # Clipping bound
    clip_mode="one-sided" # Conservative mode
)
for m in metrics:
    print(f"Decision: {'ANSWER' if m.decision_answer else 'REFUSE'}")
    print(f"Risk bound: {m.roh_bound:.3f}")from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import AnthropicBackend
# Use Claude instead of GPT
backend = AnthropicBackend(model="claude-3-5-sonnet-latest")
planner = OpenAIPlanner(backend, temperature=0.3)
# Rest of the code remains identical
items = [OpenAIItem(prompt="What is quantum entanglement?", n_samples=7, m=6)]
metrics = planner.run(items, h_star=0.05)from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import HuggingFaceBackend
# Run Llama locally
backend = HuggingFaceBackend(
    mode="transformers",
    model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    device_map="auto"  # or "cuda" or "cpu"
)
planner = OpenAIPlanner(backend, temperature=0.3)
# Same evaluation flow
metrics = planner.run([...], h_star=0.05)from htk_backends import HuggingFaceBackend
# Connect to a Text Generation Inference server
backend = HuggingFaceBackend(
    mode="tgi",
    tgi_url="http://localhost:8080"
)
planner = OpenAIPlanner(backend, temperature=0.3)from htk_backends import HuggingFaceBackend
import os
# Use Hugging Face's hosted models
backend = HuggingFaceBackend(
    mode="inference_api",
    model_id="mistralai/Mistral-7B-Instruct-v0.3",
    hf_token=os.environ["HF_TOKEN"]
)
planner = OpenAIPlanner(backend, temperature=0.3)OpenRouter provides access to 100+ models through a single API, making it ideal for comparing hallucination bounds across providers:
from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import OpenRouterBackend
# Access any model through OpenRouter's unified API
backend = OpenRouterBackend(
    model="openrouter/auto",                  # Auto-selects best available model
    # model="anthropic/claude-3.5-sonnet",    # Or specify exact model
    # api_key="...",                          # Uses OPENROUTER_API_KEY env var if not provided
    http_referer="https://your.app",          # Optional but recommended
    x_title="EDFL Decision Head (prod)",      # Optional app identifier
    providers={"allow": ["anthropic", "google", "openai"]},  # Optional: limit providers
)
planner = OpenAIPlanner(
    backend=backend,
    temperature=0.5,
    max_tokens_decision=8,     # Tiny JSON decision head
    q_floor=None,              # Or set your prior floor
)
items = [OpenAIItem(
    prompt="What is quantum entanglement?", 
    n_samples=3, 
    m=6, 
    skeleton_policy="auto"
)]
metrics = planner.run(
    items, 
    h_star=0.05, 
    isr_threshold=1.0, 
    B_clip=12.0, 
    clip_mode="one-sided"
)
for m in metrics:
    print(f"Decision: {'ANSWER' if m.decision_answer else 'REFUSE'}")
    print(f"ISR: {m.isr:.3f}, RoH bound: {m.roh_bound:.3f}")Why OpenRouter for this toolkit?
- Test calibration across many models without managing multiple API keys
- Automatic fallbacks ensure high availability for production deployments
- Cost optimization through intelligent routing
- Perfect for A/B testing different models' hallucination characteristics
from htk_backends import OllamaBackend
# Use any Ollama model
backend = OllamaBackend(
    model="llama3.1:8b-instruct",
    host="http://localhost:11434"  # Default Ollama port
)
planner = OpenAIPlanner(backend, temperature=0.3)AnthropicBackend(
    model="claude-3-5-sonnet-latest",  # or any Claude model
    api_key=None,                       # Uses ANTHROPIC_API_KEY env var if None
    request_timeout=60.0
)Requirements: pip install anthropic
OpenRouterBackend(
    model="openrouter/auto",           # Auto-routing or specific model
    api_key=None,                      # Uses OPENROUTER_API_KEY env var
    http_referer="https://your.app",   # Recommended for tracking
    x_title="Your App Name",           # Optional identifier
    providers={"allow": ["anthropic", "google"]},  # Optional filtering
)Requirements: pip install openai (OpenRouter uses OpenAI-compatible API)
Available models include:
- anthropic/claude-3.5-sonnet,- openai/gpt-4-turbo,- google/gemini-pro
- meta-llama/llama-3-70b-instruct,- mistralai/mixtral-8x7b
- See OpenRouter models for full list
The Hugging Face backend supports three operational modes:
HuggingFaceBackend(
    mode="transformers",
    model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    device_map="auto",           # GPU allocation strategy
    torch_dtype="float16",       # Optional: precision setting
    trust_remote_code=True,      # For custom model code
    model_kwargs={}              # Additional model parameters
)Requirements: pip install transformers torch
HuggingFaceBackend(
    mode="tgi",
    tgi_url="http://localhost:8080",  # Your TGI server URL
    model_id=None                      # Not needed for TGI
)Requirements: pip install requests and a running TGI server
HuggingFaceBackend(
    mode="inference_api",
    model_id="mistralai/Mistral-7B-Instruct-v0.3",
    hf_token="hf_..."  # Your Hugging Face token
)Requirements: pip install requests and a Hugging Face account
OllamaBackend(
    model="llama3.1:8b-instruct",  # Any Ollama model
    host="http://localhost:11434",  # Ollama server URL
    request_timeout=60.0
)Requirements: pip install ollama (optional) or pip install requests, and Ollama installed locally
Here's a complete example comparing different providers on the same prompt:
from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import AnthropicBackend, HuggingFaceBackend, OllamaBackend
# Define test prompt
prompt = "What are the main differences between quantum and classical computing?"
item = OpenAIItem(prompt=prompt, n_samples=5, m=6, skeleton_policy="closed_book")
# Test configuration
config = dict(
    h_star=0.05,
    isr_threshold=1.0,
    margin_extra_bits=0.2,
    B_clip=12.0,
    clip_mode="one-sided"
)
# Compare providers
providers = {
    "GPT-4o-mini": OpenAIBackend(model="gpt-4o-mini"),
    "Claude-3.5": AnthropicBackend(model="claude-3-5-sonnet-latest"),
    "Llama-3.1": HuggingFaceBackend(mode="transformers", model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"),
    "Ollama": OllamaBackend(model="llama3.1:8b-instruct")
}
results = {}
for name, backend in providers.items():
    try:
        planner = OpenAIPlanner(backend, temperature=0.3)
        metrics = planner.run([item], **config)
        results[name] = metrics[0]
        print(f"{name}: {'ANSWER' if metrics[0].decision_answer else 'REFUSE'} (RoH={metrics[0].roh_bound:.3f})")
    except Exception as e:
        print(f"{name}: Error - {e}")Let the binary event 
Build an ensemble of content-weakened prompts (the rolling priors) 
- 
Information budget: 
 $$\bar{\Delta} = \tfrac{1}{m}\sum_k \mathrm{clip}_+(\log P(y) - \log S_k(y), B)$$ (one-sided clipping; default$B=12$ nats to prevent outliers while maintaining conservative bounds).
- 
Prior masses: $q_k = S_k(\mathcal{A})$ , with:- 
$\bar{q}=\tfrac{1}{m}\sum_k q_k$ (average prior for EDFL bound)
- 
$q_{\text{lo}}=\min_k q_k$ (worst-case prior for SLA gating)
 
- 
By EDFL, the achievable reliability is bounded by:
Thus the hallucination risk (error) is bounded by 
For target hallucination rate 
- 
Bits-to-Trust: $\mathrm{B2T} = \mathrm{KL}(\mathrm{Ber}(1-h^*) | \mathrm{Ber}(q_{\text{lo}}))$ 
- 
Information Sufficiency Ratio: $\mathrm{ISR} = \bar{\Delta}/\mathrm{B2T}$ 
- 
ANSWER iff $\mathrm{ISR}\ge 1$ and$\bar{\Delta} \ge \mathrm{B2T} + \text{margin}$ (defaultmargin≈0.2nats)
Why two priors? The gate uses worst-case
$q_{\text{lo}}$ for strict SLA compliance. The RoH bound uses average$\bar{q}$ per EDFL theory. This dual approach ensures conservative safety while providing realistic risk bounds.
The toolkit exhibits different behaviors across query types, which is mathematically consistent with the framework:
Observation: May abstain despite apparent simplicity
Explanation:
- Models often attempt answers even with masked numbers (pattern recognition)
- This yields low information lift $\bar{\Delta} \approx 0$ between full prompt and skeletons
- Despite potentially low EDFL risk bound, worst-case prior gate triggers abstention (ISR < 1)
Observation: Generally answered with confidence
Explanation:
- Masking entities/dates substantially reduces answer propensity in skeletons
- Restoring these yields large $\bar{\Delta}$ that clears B2T threshold
- System answers with tight EDFL risk bound
This is not a bug but a feature: The framework prioritizes safety through worst-case guarantees while providing realistic average-case bounds.
Different model providers may exhibit varying behaviors:
- OpenAI/Anthropic: Generally produce clean JSON decisions with high compliance
- Hugging Face (Local): May require instruction-tuned variants for best results
- Ollama: Performance depends on the specific model; instruction-tuned models recommended
- Base Models: May need adjusted prompting or higher sampling for stable priors
- Prompt contains a field like Evidence:(or JSON keys)
- Skeletons erase the evidence content but preserve structure and roles; then permute blocks deterministically (seeded)
- Decision head: "Answer only if the provided evidence is sufficient; otherwise refuse."
Example with Multiple Providers
from hallucination_toolkit import OpenAIItem, OpenAIPlanner
from htk_backends import AnthropicBackend
backend = AnthropicBackend(model="claude-3-5-sonnet-latest")
prompt = """Task: Answer strictly based on the evidence below.
Question: Who won the Nobel Prize in Physics in 2019?
Evidence:
- Nobel Prize press release (2019): James Peebles (1/2); Michel Mayor & Didier Queloz (1/2).
Constraints: If evidence is insufficient or conflicting, refuse.
"""
item = OpenAIItem(
    prompt=prompt, 
    n_samples=5, 
    m=6, 
    fields_to_erase=["Evidence"], 
    skeleton_policy="auto"
)
planner = OpenAIPlanner(backend, temperature=0.3)
metrics = planner.run([item], h_star=0.05, isr_threshold=1.0)- Prompt has no evidence
- Skeletons apply semantic masking of:
- Multi-word proper nouns (e.g., "James Peebles" → "[…]")
- Years (e.g., "2019" → "[…]")
- Numbers (e.g., "3.14" → "[…]")
- Quoted spans (e.g., '"Nobel Prize"' → "[…]")
 
- Masking strengths: Progressive levels (0.25, 0.35, 0.5, 0.65, 0.8, 0.9) across skeleton ensemble
Example with Multiple Providers
from hallucination_toolkit import OpenAIItem, OpenAIPlanner
from htk_backends import OllamaBackend
backend = OllamaBackend(model="mixtral:8x7b-instruct")
item = OpenAIItem(
    prompt="Who won the 2019 Nobel Prize in Physics?",
    n_samples=7,
    m=6,
    skeleton_policy="closed_book"
)
planner = OpenAIPlanner(backend, temperature=0.3)
metrics = planner.run([item], h_star=0.05)- OpenAIBackend(model, api_key=None)– Original OpenAI wrapper
- AnthropicBackend(model, api_key=None)– Anthropic Claude adapter
- HuggingFaceBackend(mode, model_id, ...)– Hugging Face adapter (3 modes)
- OllamaBackend(model, host)– Ollama local model adapter
- OpenAIItem(prompt, n_samples=5, m=6, fields_to_erase=None, skeleton_policy="auto")– One evaluation item
- OpenAIPlanner(backend, temperature=0.5, q_floor=None)– Runs evaluation (works with any backend):- run(items, h_star, isr_threshold, margin_extra_bits, B_clip=12.0, clip_mode="one-sided") -> List[ItemMetrics]
- aggregate(items, metrics, alpha=0.05, h_star, ...) -> AggregateReport
 
- make_sla_certificate(report, model_name)– Creates formal SLA certificate
- save_sla_certificate_json(cert, path)– Exports certificate for audit
- generate_answer_if_allowed(backend, item, metric)– Only emits answer if decision was ANSWER
Every ItemMetrics includes:
- 
delta_bar: Information budget (nats)
- 
q_conservative: Worst-case prior$q_{\text{lo}}$ 
- 
q_avg: Average prior$\bar{q}$ 
- 
b2t: Bits-to-Trust requirement
- 
isr: Information Sufficiency Ratio
- 
roh_bound: EDFL hallucination risk bound
- 
decision_answer: Boolean decision
- 
rationale: Human-readable explanation
- 
meta: Dict withq_list,S_list_y,P_y,closed_book, etc.
On a labeled validation set:
- Sweep the margin parameter from 0 to 1 nats
- For each margin, compute:
- Empirical hallucination rate among answered items
- Wilson upper bound at 95% confidence
 
- 
Select smallest margin where Wilson upper bound ≤ target $h^*$ (e.g., 5%)
- 
Freeze policy: $(h^*, \tau, \text{margin}, B, \text{clip_mode}, m, r, \text{skeleton_policy})$ 
The toolkit provides comprehensive metrics:
- Answer/abstention rates
- Empirical hallucination rate + Wilson bound
- Distribution of per-item EDFL RoH bounds
- Worst-case and median risk bounds
- Complete audit trail
| Provider | Best For | Considerations | 
|---|---|---|
| OpenAI | Production deployment, consistent JSON | Requires API key, costs per token | 
| Anthropic | High-quality reasoning, safety-critical | Requires API key, may have rate limits | 
| OpenRouter | Multi-model testing, cost optimization | Single API for 100+ models, automatic fallbacks | 
| HuggingFace (Local) | Full control, no API costs | Requires GPU, setup complexity | 
| HuggingFace (TGI) | Team deployments, caching | Requires server setup | 
| HuggingFace (API) | Quick prototyping | Rate limits, requires HF token | 
| Ollama | Local experimentation | Easy setup, model quality varies | 
| Provider | Latency per Item | Cost | Setup Complexity | 
|---|---|---|---|
| OpenAI | 2-5 seconds | ~$0.01-0.03 | Low | 
| Anthropic | 3-6 seconds | ~$0.02-0.05 | Low | 
| HF Local | 1-10 seconds | Free (GPU cost) | Medium-High | 
| HF TGI | 1-3 seconds | Server costs | High | 
| HF API | 3-8 seconds | Free tier/paid | Low | 
| Ollama | 2-15 seconds | Free (local) | Low | 
Solution: Use instruction-tuned model variants (e.g., -Instruct suffixes)
Expected: Models have different knowledge/calibration; the framework adapts accordingly
Solution: Increase request_timeout parameter or reduce batch size
.
├── app/                    # Application entry points
│   ├── web/web_app.py     # Streamlit UI
│   ├── cli/frontend.py    # Interactive CLI
│   ├── examples/          # Example scripts
│   └── launcher/entry.py  # Unified launcher
├── hallbayes/             # Core modules
│   ├── hallucination_toolkit.py  # Main toolkit
│   ├── htk_backends.py          # Universal backend adapters
│   └── build_offline_backend.sh
├── electron/              # Desktop wrapper
├── launch/                # Platform launchers
├── release/              # Packaged artifacts
├── bin/                  # Offline backend binary
├── requirements.txt
├── pyproject.toml
└── README.md
from hallbayes import OpenAIPlanner, OpenAIItem, make_sla_certificate
from hallbayes.htk_backends import AnthropicBackend  # or any other backend
# Choose your provider
backend = AnthropicBackend(model="claude-3-5-sonnet-latest")
# Configure and run
items = [OpenAIItem(prompt="...", n_samples=7, m=6)]
planner = OpenAIPlanner(backend, temperature=0.3)
metrics = planner.run(items, h_star=0.05)
# Generate SLA certificate
report = planner.aggregate(items, metrics)
cert = make_sla_certificate(report, model_name="Claude-3.5-Sonnet")
save_sla_certificate_json(cert, "sla.json")streamlit run app/web/web_app.pyfrom hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import AnthropicBackend, OllamaBackend
import json
# Load prompts
with open("prompts.json") as f:
    prompts = json.load(f)
# Setup providers
providers = {
    "claude": AnthropicBackend(model="claude-3-5-sonnet-latest"),
    "llama": OllamaBackend(model="llama3.1:8b-instruct")
}
# Process with each provider
results = {}
for name, backend in providers.items():
    planner = OpenAIPlanner(backend, temperature=0.3)
    items = [OpenAIItem(prompt=p, n_samples=5, m=6) for p in prompts]
    metrics = planner.run(items, h_star=0.05)
    results[name] = planner.aggregate(items, metrics)If you're already using the toolkit with OpenAI, here's how to try other providers:
# Original (OpenAI only)
from hallucination_toolkit import OpenAIBackend
backend = OpenAIBackend(model="gpt-4o-mini")
# New (Any provider) - just change these two lines:
from htk_backends import AnthropicBackend  # or HuggingFaceBackend, OllamaBackend
backend = AnthropicBackend(model="claude-3-5-sonnet-latest")
# Everything else stays exactly the same!
planner = OpenAIPlanner(backend, temperature=0.3)
# ... rest of your code unchangedBased on the Paper: Predictable Compression Failures: Why Language Models Actually Hallucinate - https://arxiv.org/abs/2509.11208
This project is licensed under the MIT License – see the LICENSE file for details.
Developed by Hassana Labs (https://hassana.io).