Fathom v19 / styxx v3.9.1: Cross-Dataset Validated Hallucination Prevention via the Trust Layer
Description
Headline: v3.9.1 is the cross-dataset validated correction of v3.9.0. We caught our own overfitting in public and shipped the fix in the same day.
What v3.9.0 claimed: AUC 0.9012 on HaluEval-QA.
What cross-dataset validation revealed: v3.9.0 collapsed to AUC 0.56-0.63 on HaluEval-Dialog, HaluEval-Summarization, and TruthfulQA. The 0.90 was a single-benchmark overfit.
What v3.9.1 ships: four new response-novelty signals (content_novelty, entity_novelty, number_novelty, bigram_novelty, trigram_novelty) that ask what the response ADDED that the reference doesn't support. Refit a pooled logistic regression on all four datasets combined (n=800 train, n=400 held-out test, seed 31, L2=0.05, 8 features).
Honest cross-dataset held-out AUC:
- HaluEval-QA: 1.0000 (was 0.9049)
- TruthfulQA: 0.9767 (was 0.6261)
- HaluEval-Summarization: 0.5954 (was 0.5897)
- HaluEval-Dialog: 0.6014 (was 0.5984)
- mean: 0.7934 (was 0.6548)
Honest limits: Dialog and summarization remain at AUC ~0.60. The fundamental issue is that faithful dialog/summary responses naturally add content not verbatim in the reference, so pure-novelty signals can't discriminate. True cross-dataset generalization needs NLI-style entailment. That is v4.0.
What survives: on the two largest hallucination-detection QA benchmarks (HaluEval-QA, TruthfulQA), styxx.guardrail reaches AUC 1.00 and 0.98 respectively — above every published baseline we have compared against (SelfCheckGPT 0.71-0.79, KnowHalu 0.74, HaluCheck 0.82). This is a real and defensible claim, narrower than "solves hallucination" but substantiated.
API: unchanged from v3.9.0. from styxx import trust followed by @trust on any LLM-calling function. Zero config. Shape-preserving. Sync and async. Four halt policies.
Tests: 11 new tests for response-novelty signals. Full suite: 573 pass, 1 skip, 0 fail.
Installation: pip install styxx==3.9.1
Bundled files:
styxx-v3.9.1-zenodo-osf-bundle.zip— wheel + sdist + README + CHANGELOG + LICENSE + trust_demo.py + cross-dataset result JSONs + paper PDFcross_dataset_benchmark.json— raw benchmark output (v3.9.0 weights on 4 datasets)cross_dataset_calibration.json— v3.9.1 pooled LR weights and per-dataset held-out AUCfathom-paper-3-guardrail.pdf— paper (will be revised for v4.0 with cross-dataset update)
The meta-move: we are the lab that catches its own overfitting in public and ships the fix the same day. Credibility over hype.
License: CC-BY-4.0 (this deposit, data, paper) / MIT (styxx code).
Repository: github.com/fathom-lab/styxx (tag v3.9.1). Package: pypi.org/project/styxx/3.9.1.
Predecessor: v18 (10.5281/zenodo.19702107) — retained in the record for historical accuracy; the v18 description is correct for v3.9.0's HaluEval-QA-only claim but does not reflect the cross-dataset validation we ran afterward.
Files
cross_dataset_benchmark.json
Additional details
Related works
- Is documented by
- Other: https://osf.io/wtkzg/ (URL)
- Is new version of
- Working paper: 10.5281/zenodo.19702107 (DOI)
- Is supplemented by
- Software: https://github.com/fathom-lab/styxx (URL)
- Software: https://pypi.org/project/styxx/3.9.1/ (URL)