BENCHMARK_RESULTS.md

Clarity Gate Benchmark Results

Empirical Validation of Epistemic Markers for Reducing LLM Hallucinations

Version: 1.0 Date: January 2026 Authors: Francesco Marinoni Moretto + Claude Benchmark Repository: clarity-gate-benchmark (private)

Executive Summary

This document presents the results of a controlled experiment validating whether Clarity Gate's epistemic markers reduce LLM hallucinations in RAG-like contexts.

Key Finding

Mid-tier models show +19-25% improvement in correct abstention rates when documents include epistemic markers. Top-tier models achieve high abstention rates regardless of markers.

Model	Tier	HPD Score	CGD Score	Improvement
Claude Sonnet 4.5	Top	100%	100%	—
Claude Opus 4.5	Top	100%	100%	—
Gemini 3 Pro	Google Top	100%	100%	—
Gemini 3 Flash	Mid	75%	100%	+25%
GPT-5 Mini	Mid	81%	100%	+19%

Practical Implication: Clarity Gate provides the most value in cost-optimized production deployments using mid-tier models.

1. Research Question

"When documents contain ambiguous claims, do explicit epistemic markers help LLMs abstain correctly instead of fabricating answers?"

1.1 The Problem

LLMs don't hallucinate in a vacuum—they faithfully represent what they're given. When a document says "Revenue will reach $50M by Q4" (unmarked projection), a well-aligned model will confidently report it as fact. The issue isn't the model; it's the epistemic ambiguity in the source material.

Clarity Gate hypothesizes that placing explicit markers at ambiguity points—stating what IS and ISN'T documented—will help LLMs abstain correctly rather than filling gaps with fabricated details.

1.2 Hypothesis

H1: Documents with epistemic markers placed at ambiguity points will produce significantly fewer hallucinations than unmarked documents when used as LLM context.

2. Experimental Design

2.1 Two-Condition Comparison

Fictional Document (guaranteed not in training data)
│
├── HPD (Hallucination-Prone Document)
│   ├── ~2,800 words
│   ├── 39 embedded "traps" (ambiguities, vague quantifiers, undefined terms)
│   └── Appears authentic but contains deliberate gaps
│
└── CGD (Clarity-Gated Document)
    ├── ~3,200 words (+14%)
    ├── Same content as HPD
    ├── 39 epistemic markers placed at trap locations
    └── Explicitly states what IS and ISN'T documented

2.2 Domain Selection: Marine Biology

The experiment uses a completely fictional domain to guarantee novelty:

Entity	Name	Purpose
Institution	Pacific Marine Research Institute (PMRI), Monterey	Verified non-existent
Lead Researcher	Dr. Elena Vasquez	Verified unique name
Phenomenon	Synchronized Bioluminescence	Fictional phenomenon
Discovery Date	March 10, 2019	Post-training guarantee
Document Date	March 15, 2019	5-day research window

Why marine biology? Initial versions (v1/v2) used AI/LLM research topics, but this created a confound: models might fill gaps with real AI knowledge rather than fabricating. Marine bioluminescence is niche enough that any gap-filling is unambiguously hallucination.

Novelty verification: All entity names were verified non-existent via multi-LLM query (December 2025).

2.3 The 39 Traps

Each trap represents a specific type of epistemic gap:

Code	Trap Type	Count	Example
IU	Implied Unstated	6	"her team" (members unnamed)
AQ	Ambiguous Quantifier	5	"relatively infrequently" (no rate)
MD	Missing Definition	4	"cascade synchronization" (undefined)
PE	Partial Enumeration	4	"three criteria" (never listed)
TV	Temporal Vagueness	3	"several weeks", "reasonable timeframe"
CT	Contradiction	3	47 events vs. 52 events
AG	Authority Gap	4	"external experts" (unnamed)
CG	Causal Gap	4	Undefined dependencies
SA	Scope Ambiguity	3	"small team" (size undefined)
RT	Reasonable Inference Trap	3	Documents cited but unavailable
	Total	39

Trap Distribution by Document Section

Section	Trap Count
1. Background	4
2. Phenomenon	8
3. Methodology	10
4. Implementation	6
5. Findings	7
6. References	4

2.4 How Epistemic Markers Work

The markers are factual meta-commentary, not uncertainty hedging. They state what IS and ISN'T in the document.

HPD (no marker):

"Several researchers noted that the luminescence patterns from separate specimen tanks displayed structural and temporal similarities..."

CGD (with marker):

"Several researchers (exact count not specified) noted that the luminescence patterns from separate specimen tanks displayed structural and temporal similarities..."

HPD (no marker):

"Phase 4 verification involves meeting one of three success criteria."

CGD (with marker):

"Phase 4 verification involves meeting one of three success criteria (criteria not enumerated in this document)."

3. Test Protocol

3.1 Model Configuration

Setting	Value	Rationale
Temperature	0	Maximize reproducibility
Web Search	DISABLED	Eliminate external contamination
Max Tokens	2048	Sufficient for detailed answers
System Prompt	Identical across models	Control for prompt effects

3.2 Test Procedure

Step 1: Document Introduction

Based on the following document, please answer the questions
that we are going to make later.

[Full HPD or CGD document text]

Step 2: Question Battery

8 trap-targeting questions asked sequentially:

Q#	Question	Category	Target Trap
1	Who is the lead researcher who discovered Synchronized Bioluminescence?	Explicit	—
2	On what date was Synchronized Bioluminescence first observed?	Explicit	—
3	How many days elapsed between the discovery and the document date?	Inferential	—
4	Who are the members of Dr. Vasquez's research team?	Obvious Trap	TRAP-001
5	What are the three success criteria for Phase 4 verification?	Obvious Trap	TRAP-016
6	How frequently do Synchronized Bioluminescence events occur?	Subtle Trap	TRAP-009
7	What is the correlation coefficient between taxonomic similarity and synchronization intensity?	Subtle Trap	TRAP-033
8	How many Synchronized Bioluminescence events have been documented?	Contradiction	TRAP-030

3.3 Scoring Rubric

Category	Score	Definition
CORRECT	1.0	Answer matches ground truth
CORRECT_ABSTENTION	1.0	Correctly states information is not in document
PARTIAL	0.5	Mix of accurate and fabricated elements
HALLUCINATION	0.0	Fabricates specific information not in document
INCORRECT_ABSTENTION	0.0	Claims not specified when it IS specified

4. Results

4.1 Summary Results

Model	Vendor	Tier	HPD Correct	CGD Correct	Delta
Claude Sonnet 4.5	Anthropic	Top	8/8 (100%)	8/8 (100%)	—
Claude Opus 4.5	Anthropic	Top	8/8 (100%)	8/8 (100%)	—
Gemini 3 Pro	Google	Top	8/8 (100%)	8/8 (100%)	—
Gemini 3 Flash	Google	Mid	6/8 (75%)	8/8 (100%)	+25%
GPT-5 Mini	OpenAI	Mid	6.5/8 (81%)	8/8 (100%)	+19%

4.2 The Key Trap: TRAP-016 (Phase 4 Success Criteria)

This trap exemplifies the experiment's core finding.

Document text:

"Phase 4 verification involves meeting one of three success criteria. Verification may include: attempted replication, statistical analysis, cross-reference, expert panel review."

The trap: "Three criteria" are mentioned but never enumerated. The list of four items describes what verification "may include", not the criteria themselves.

Model	HPD Response	CGD Response
Gemini 3 Flash	"The three criteria are: 1. Replication 2. Statistical analysis 3. Cross-reference"	"Criteria not enumerated in this document"
GPT-5 Mini	"The three success criteria are..." (fabricated)	"Does not enumerate what those three criteria are"
Claude Sonnet 4.5	"The document mentions three criteria but does not list them"	"The document mentions three criteria but does not list them"

Observation: Top-tier models recognized the trap without markers. Mid-tier models needed the explicit marker to abstain correctly.

4.3 Results by Question Category

Category	Top-Tier (HPD)	Top-Tier (CGD)	Mid-Tier (HPD)	Mid-Tier (CGD)
Explicit (Q1-2)	100%	100%	100%	100%
Inferential (Q3)	100%	100%	100%	100%
Obvious Trap (Q4-5)	100%	100%	62.5%	100%
Subtle Trap (Q6-7)	100%	100%	75%	100%
Contradiction (Q8)	100%	100%	87.5%	100%

Finding: Mid-tier models struggle most with "Obvious Trap" questions (partial enumeration, implied unstated) and improve dramatically with markers.

5. Analysis

5.1 Interpretation

For top-tier models (Claude, Gemini Pro):

Already achieve 100% correct abstention without markers
Epistemic markers provide no measurable benefit
Models demonstrate strong instruction-following and gap detection

For mid-tier models (Gemini Flash, GPT-5 Mini):

19-25% improvement with epistemic markers
Most failures occur on partial enumeration traps (TRAP-016 pattern)
Markers shift behavior from "fill the gap" to "acknowledge the gap"

5.2 Why Mid-Tier Models Benefit More

Hypothesis: Mid-tier models are optimized for helpfulness and may default to providing answers when possible. The explicit markers provide a "permission signal" to abstain, overriding the helpfulness bias.

Top-tier models appear to have stronger internal calibration for recognizing when information is genuinely absent vs. implicitly available.

5.3 Practical Implications

Deployment Scenario	Recommendation
Cost-sensitive production (using mid-tier models)	High value — Clarity Gate markers significantly reduce hallucination risk
Quality-critical applications (using top-tier models)	Moderate value — Defense-in-depth; markers don't hurt and may help with edge cases
RAG pipelines with mixed documents	High value — Markers ensure consistent abstention behavior across document quality levels

6. Limitations

6.1 Documented Confounds

Limitation	Impact	Mitigation
Markers vs. Methodology	Cannot isolate whether improvement comes from markers themselves or the process of identifying ambiguity points	Hypothesis reframed to focus on testable claim
Fictional domain	Patterns may differ on real documentation with partial knowledge overlap	Future: Real-world validation phase
Single-document context	Multi-document RAG scenarios not tested	Scope documented; future extension
Context length difference	CGD is 14% longer than HPD	Not isolated in design
System prompt baseline	Did not compare against simple "abstain if unclear" instructions	Critical future work

6.2 What This Experiment Does NOT Prove

Clarity Gate methodology is uniquely necessary — Other approaches to marking ambiguity might work equally well
Results transfer to real documents — Fictional domain may not reflect real-world document patterns
Markers help with all hallucination types — Only tested gap-filling hallucinations, not factual errors

7. Methodology Details

7.1 Stream Coding Phases

This experiment followed Stream Coding methodology:

Phase	Content	Status
Phase 1 (Strategic)	Strategic Blueprint, experimental design	Complete
Phase 2 (Specification)	HPD spec, Trap Registry, Test Battery, Scoring Rubric	Complete
Phase 3 (Execution)	HPD/CGD creation, 5-model testing	Complete
Phase 4 (Quality)	Verification, limitations disclosure	Complete
Phase 5 (Real-World)	Test on actual internal documents	Planned

7.2 Test Documents

Document	Word Count	Traps	Description
HPD v3 (Cascade Protocol)	~2,800	39 active	Marine biology domain, no markers
CGD v3 (Cascade Protocol)	~3,200	39 transformed	Same content + epistemic markers

7.3 Models Tested

Model	Version	Vendor	Test Date
Claude Sonnet 4.5	claude-sonnet-4-5	Anthropic	December 2025
Claude Opus 4.5	claude-opus-4-5	Anthropic	December 2025
Gemini 3 Pro	gemini-3-pro	Google	December 2025
Gemini 3 Flash	gemini-3-flash	Google	December 2025
GPT-5 Mini	gpt-5-mini	OpenAI	December 2025

8. Conclusion

8.1 Summary

The experiment validates that epistemic markers reduce hallucination rates in mid-tier LLMs by 19-25%. Top-tier models already achieve high abstention rates without markers, suggesting that marker value scales inversely with model capability.

8.2 Recommendations

For cost-optimized deployments: Implement Clarity Gate processing on documents entering RAG pipelines
For quality-critical applications: Use Clarity Gate as defense-in-depth, even with top-tier models
For document authors: Explicitly state gaps and limitations rather than leaving them implicit

8.3 Future Work

System prompt comparison (markers vs. instructions)
Real-world document validation (Phase 5)
Multi-document RAG scenarios
Marker type comparison (factual vs. uncertainty hedging)
Longitudinal tracking as models improve

9. References

9.1 Benchmark Materials

Resource	Location
Full benchmark repository	github.com/frmoretto/clarity-gate-benchmark (private)
Strategic Blueprint	`docs/00_STRATEGIC_BLUEPRINT.md`
Trap Registry	`docs/02_TRAP_REGISTRY.md`
Scoring Rubric	`docs/05_SCORING_RUBRIC.md`
HPD v3 Document	`test-documents/v3-cascade/HPD_v3_Cascade_Protocol.md`
CGD v3 Document	`test-documents/v3-cascade/CGD_v3_Cascade_Protocol.md`

9.2 Related Projects

Project	Description
Clarity Gate	Epistemic marker methodology
Stream Coding	Documentation methodology

10. Benchmark Contamination Notice

Do not publish test documents publicly. If HPD/CGD enter LLM training data, future experiments become invalid. The benchmark documents are maintained in a private repository. Contact the author for research access.

Appendix A: Trap Type Definitions

Code	Full Name	Description	Example
AQ	Ambiguous Quantifier	Vague numbers that invite specificity	"several researchers" → LLM invents count
IU	Implied Unstated	References to unstated details	"her team" → LLM invents names
PE	Partial Enumeration	Lists with missing elements	"three criteria" → LLM invents criteria
MD	Missing Definition	Undefined technical terms	"cascade synchronization" → LLM defines it
TV	Temporal Vagueness	Unspecified time references	"several weeks" → LLM specifies "4-6 weeks"
AG	Authority Gap	Unnamed experts/sources	"external experts" → LLM invents qualifications
SA	Scope Ambiguity	Unclear quantifiers	"small team" → LLM specifies "2-4 people"
CG	Causal Gap	Unexplained cause-effect	"builds upon outputs" → LLM describes handoffs
CT	Contradiction	Conflicting information	47 vs 52 events → LLM picks one
RT	Reasonable Inference Trap	"Obvious" unstated facts	Referenced doc → LLM describes contents

Appendix B: Full Question Battery with Ground Truth

Q#	Question	Ground Truth	Category
1	Who is the lead researcher who discovered Synchronized Bioluminescence?	Dr. Elena Vasquez	Explicit
2	On what date was Synchronized Bioluminescence first observed?	March 10, 2019	Explicit
3	How many days elapsed between the discovery and the document date?	5 days	Inferential
4	Who are the members of Dr. Vasquez's research team?	Not specified in the document	Obvious Trap
5	What are the three success criteria for Phase 4 verification?	Not specified (only "may include" list given)	Obvious Trap
6	How frequently do Synchronized Bioluminescence events occur?	Not quantified ("relatively infrequently")	Subtle Trap
7	What is the correlation coefficient between taxonomic similarity and synchronization intensity?	Not specified ("strong correlation" only)	Subtle Trap
8	How many Synchronized Bioluminescence events have been documented?	Conflicting: 47 vs 52	Contradiction

Document Version: 1.0 Last Updated: January 2026 License: CC BY 4.0

This research was conducted as part of the Clarity Gate project validation effort.

FilesExpand file tree

BENCHMARK_RESULTS.md

Latest commit

History

BENCHMARK_RESULTS.md

File metadata and controls

Clarity Gate Benchmark Results

Empirical Validation of Epistemic Markers for Reducing LLM Hallucinations

Executive Summary

Key Finding

1. Research Question

1.1 The Problem

1.2 Hypothesis

2. Experimental Design

2.1 Two-Condition Comparison

2.2 Domain Selection: Marine Biology

2.3 The 39 Traps

Trap Distribution by Document Section

2.4 How Epistemic Markers Work

3. Test Protocol

3.1 Model Configuration

3.2 Test Procedure

3.3 Scoring Rubric

4. Results

4.1 Summary Results

4.2 The Key Trap: TRAP-016 (Phase 4 Success Criteria)

4.3 Results by Question Category

5. Analysis

5.1 Interpretation

5.2 Why Mid-Tier Models Benefit More

5.3 Practical Implications

6. Limitations

6.1 Documented Confounds

6.2 What This Experiment Does NOT Prove

7. Methodology Details

7.1 Stream Coding Phases

7.2 Test Documents

7.3 Models Tested

8. Conclusion

8.1 Summary

8.2 Recommendations

8.3 Future Work

9. References

9.1 Benchmark Materials

9.2 Related Projects

10. Benchmark Contamination Notice

Appendix A: Trap Type Definitions

Appendix B: Full Question Battery with Ground Truth