Codestin Search App

661 lines (544 loc) · 26 KB
Faithfulness Verifier (FV) — Layer 6b of VORTEXRAG
Solves: HALLUCINATION — ensures the generated answer is grounded in W*.
Mathematical Foundation:
  After generation, the FV verifies that the LLM's answer is GROUNDED in the
  context window W* and has not hallucinated. It combines two complementary
  DELTA-R SCORE (Hallucination Score):
    ΔR(answer, W*) = 1 − ROUGE-L(answer, W*) × NLI_entailment(answer, W*)
      ROUGE-L(a, W*)          — lexical fidelity: does the answer use words/phrases
                                that actually appear in W*? Uses Longest Common
                                Subsequence (LCS) for robustness to paraphrasing.
      NLI_entailment(a, W*)   — logical grounding: does W* logically entail the
                                answer? Uses DeBERTa-v3 cross-encoder NLI when
                                available; falls back to ROUGE-L approximation.
      Product ROUGE × NLI     — JOINT faithfulness criterion: BOTH signals must
                                be strong. High ROUGE alone or high NLI alone is
                                insufficient.
  ACCEPTANCE CONDITION:
    Answer is ACCEPTED iff ΔR(answer, W*) ≤ δ_FV   (default: 0.15)
  REGENERATION LOOP:
    if ΔR > δ_FV:
        re-rank context by Φ̃ (different top-m selection)
        regenerate (max 3 iterations)
    Return the answer with lowest ΔR across all iterations.
WHY ROUGE-L × NLI (MULTIPLICATIVE)?
  ROUGE-L alone (no NLI):
    An answer that COPIES phrases from W* but contradicts their meaning.
    Example: W* says "X causes Y"; answer says "Y causes X" (word overlap=high,
    but meaning is inverted). ROUGE-L ≈ 0.85, NLI ≈ 0.05 → product=0.043 → REJECTED ✓
  NLI alone (no ROUGE-L):
    An answer that is logically consistent with W* but uses invented vocabulary.
    Example: W* uses "myocardial infarction"; answer says "heart crash" (correct
    meaning but fabricated term). ROUGE-L ≈ 0.12, NLI ≈ 0.90 → product=0.108 → REJECTED ✓
  Both high (ROUGE × NLI):
    Answer that accurately paraphrases W*. ROUGE-L ≈ 0.85, NLI ≈ 0.90 →
    product=0.765 → ΔR=0.235 → borderline.
    ROUGE-L=0.92, NLI=0.94 → product=0.865 → ΔR=0.135 ≤ 0.15 → ACCEPTED ✓
  This multiplicative gating requires BOTH lexical fidelity AND logical
  entailment — no "rescue" effect where one high score compensates for a low one.
WHY ROUGE-L NOT ROUGE-1/2?
  ROUGE-L uses Longest Common Subsequence (LCS). LCS is robust to:
    - Paraphrasing (different word order, same meaning)
    - Moderate insertions/deletions within phrases
  ROUGE-1/2 would penalize legitimate paraphrases as hallucinations.
  ROUGE-L correctly identifies them as faithful restatements.
WHY MAX 3 ITERATIONS?
  Empirically: ~94% of fixable hallucinations are resolved within 2 iterations.
  Beyond 3 iterations, additional generations either converge on the same answer
  (context quality issue, not generation) or degrade (temperature effects).
  The 3-iteration cap: (1) controls latency, (2) detects when the problem is
  retrieval quality rather than generation, (3) matches empirical convergence data.
SENTENCE-LEVEL VERIFICATION:
  The FV can operate at sentence granularity, computing per-sentence ΔR scores.
  This enables: (1) identifying which specific claims are hallucinated, (2)
  citation tracing — which context chunk supports which answer sentence, and
  (3) fine-grained regeneration (re-generate only hallucinated sentences).
IMPLEMENTATION NOTES:
  - ROUGE-L implemented from scratch (pure Python + NumPy): no external deps.
  - ROUGE-1 and ROUGE-2 implemented for completeness (n-gram overlap).
  - NLI requires optional `sentence-transformers` with a CrossEncoder model.
  - All ROUGE metrics return F1 scores (harmonic mean of precision and recall).
from __future__ import annotations
import numpy as np
from dataclasses import dataclass
from typing import Callable, NamedTuple
    from sentence_transformers import CrossEncoder  # type: ignore[import-untyped]
    CROSS_ENCODER_AVAILABLE = True
except ImportError:
    CrossEncoder = None  # type: ignore[assignment,misc]
    CROSS_ENCODER_AVAILABLE = False
from .ccb import OrderedContextSlot
class FVConfig:
    Configuration for Faithfulness Verifier.
    delta_fv: ΔR acceptance threshold. Lower = stricter faithfulness requirement.
      δ=0.10: very strict (safety-critical applications)
      δ=0.15: default (research-grade faithfulness)
      δ=0.25: lenient (exploratory generation)
      δ=0.40: permissive (creative applications)
    use_nli: set True when a CrossEncoder NLI model is available. Falls back
    to ROUGE-L as a NLI proxy when False.
    nli_model: CrossEncoder model name for NLI entailment scoring.
    DeBERTa-v3 small is recommended for latency; large for accuracy.
    delta_fv: float = 0.15
    max_iterations: int = 3
    nli_model: str = "cross-encoder/nli-deberta-v3-small"
    use_nli: bool = False
class FVResult(NamedTuple):
    Result of faithfulness verification for a single answer.
    All intermediate scores are preserved for audit and analysis.
    answer:      str
    delta_r:     float    # ΔR hallucination score ∈ [0, 1]; lower = better
    rouge_l:     float    # ROUGE-L lexical fidelity ∈ [0, 1]
    rouge_1:     float    # ROUGE-1 unigram overlap ∈ [0, 1]
    rouge_2:     float    # ROUGE-2 bigram overlap ∈ [0, 1]
    nli_score:   float    # NLI entailment score ∈ [0, 1]
    grounding:   float    # 1 − ΔR = ROUGE-L × NLI ∈ [0, 1]
    accepted:    bool     # ΔR ≤ δ_FV
    iteration:   int      # which regeneration attempt produced this answer
class SentenceFVResult(NamedTuple):
    """Per-sentence faithfulness verification result."""
    sentence:    str
    delta_r:     float
    rouge_l:     float
    nli_score:   float
    accepted:    bool
    best_citation: str   # [C1], [C2], etc. — most supporting context chunk
class FaithfulnessVerifier:
    Post-generation hallucination detector and regeneration controller.
    The FV is the final quality gate in VORTEXRAG. It measures whether the
    LLM's answer is grounded in the retrieved context W* using the ΔR metric:
    a joint lexical-logical faithfulness score that requires both ROUGE-L
    (words actually appear in W*) AND NLI entailment (answer is logically
    supported by W*).
    The FV closes the VORTEXRAG feedback loop: if an answer fails ΔR ≤ δ_FV,
    it triggers regeneration (with different context selection), giving the
    system up to max_iterations attempts to produce a faithful answer.
    Key design decision: return the BEST answer across all iterations (lowest
    ΔR), not the last one. This ensures the answer quality can only improve
    across iterations, never degrade.
        fv = FaithfulnessVerifier()
        result = fv.verify(answer, context_string)
        print(f"ΔR={result.delta_r:.4f}, Accepted={result.accepted}")
        # Full verification with retry
        result = fv.verify_with_retry(context, generate_fn)
        # Per-sentence analysis
        sentence_results = fv.sentence_level_verify(answer, context)
        # Citation tracing
        citations = fv.citation_trace(answer, ordered_slots)
    def __init__(self, config: FVConfig | None = None):
        self.config = config or FVConfig()
        self._nli_model = None
        if self.config.use_nli and CROSS_ENCODER_AVAILABLE:
            self._load_nli()
    def _load_nli(self) -> None:
        """Load the CrossEncoder NLI model."""
        try:
            self._nli_model = CrossEncoder(self.config.nli_model)
        except Exception:
            self._nli_model = None
    # ──── Tokenization ────────────────────────────────────────────────────────
    def _tokenize(self, text: str) -> list[str]:
        """
        Lowercase word tokenizer using regex word boundary matching.
        Extracts only alphabetic/alphanumeric tokens, strips punctuation.
        This matches the standard ROUGE tokenization behavior.
        """
        return re.findall(r'\b\w+\b', text.lower())
    def _split_sentences(self, text: str) -> list[str]:
        """
        Split text into sentences using regex-based boundary detection.
        Handles common sentence endings (. ! ?) with uppercase continuation.
        Falls back to returning the full text as one sentence for very short inputs.
        """
        sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])', text.strip())
        return [s.strip() for s in sentences if s.strip()]
    def _ngrams(self, tokens: list[str], n: int) -> list[tuple[str, ...]]:
        """Extract n-grams from a token list."""
        return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]
    # ──── ROUGE Metrics ───────────────────────────────────────────────────────
    def _lcs_length(self, a: list[str], b: list[str]) -> int:
        """
        Compute LCS length via space-optimized dynamic programming.
        Time: O(|a|·|b|)   Space: O(min(|a|,|b|))
        The space optimization reduces memory from O(m·n) to O(n) by
        only keeping the previous and current DP rows.
        """
        m, n = len(a), len(b)
        if m == 0 or n == 0:
            return 0
        # Ensure a is the longer sequence for space optimization
        if m < n:
            a, b = b, a
            m, n = n, m
        prev = [0] * (n + 1)
        curr = [0] * (n + 1)
        for i in range(1, m + 1):
            for j in range(1, n + 1):
                if a[i-1] == b[j-1]:
                    curr[j] = prev[j-1] + 1
                else:
                    curr[j] = max(curr[j-1], prev[j])
            prev, curr = curr, [0] * (n + 1)
        return prev[n]
    def rouge_l(self, hypothesis: str, reference: str) -> float:
        """
        Compute ROUGE-L F1 between hypothesis (answer) and reference (context).
        ROUGE-L F1:
          P_lcs = LCS(hyp, ref) / |hyp|    — precision: fraction of answer in context
          R_lcs = LCS(hyp, ref) / |ref|    — recall: fraction of context covered
          F_lcs = (2 · P · R) / (P + R + ε)
        ROUGE-L is the primary faithfulness signal in ΔR because:
          1. LCS is position-independent (handles paraphrasing)
          2. LCS penalizes invented facts that don't appear in context
          3. ROUGE-L F1 balances precision (no hallucination) and recall (coverage)
        """
        hyp = self._tokenize(hypothesis)
        ref = self._tokenize(reference)
        if not hyp or not ref:
            return 0.0
        lcs = self._lcs_length(hyp, ref)
        p = lcs / len(hyp)
        r = lcs / len(ref)
        if p + r < 1e-10:
            return 0.0
        return float(2 * p * r / (p + r))
    def rouge_n(self, hypothesis: str, reference: str, n: int = 1) -> float:
        """
        Compute ROUGE-N F1 (n-gram overlap) between hypothesis and reference.
        ROUGE-1 (n=1): unigram overlap — sensitive to vocabulary coverage.
        ROUGE-2 (n=2): bigram overlap — sensitive to phrase faithfulness.
        ROUGE-N is not used in ΔR (ROUGE-L is preferred) but is available
        for extended analysis and comparison with external benchmarks.
        Formula:
          count_match(n) = Σ min(count(ngram, hyp), count(ngram, ref))
          P = count_match / |hyp_ngrams|
          R = count_match / |ref_ngrams|
          F = (2·P·R) / (P+R)
        """
        hyp_tokens = self._tokenize(hypothesis)
        ref_tokens = self._tokenize(reference)
        if not hyp_tokens or not ref_tokens:
            return 0.0
        hyp_ngrams = self._ngrams(hyp_tokens, n)
        ref_ngrams = self._ngrams(ref_tokens, n)
        if not hyp_ngrams or not ref_ngrams:
            return 0.0
        # Count n-gram frequencies
        from collections import Counter
        hyp_count = Counter(hyp_ngrams)
        ref_count = Counter(ref_ngrams)
        # Matched n-gram count
        match_count = sum(
            min(hyp_count[ng], ref_count[ng]) for ng in hyp_count
        p = match_count / len(hyp_ngrams)
        r = match_count / len(ref_ngrams)
        if p + r < 1e-10:
            return 0.0
        return float(2 * p * r / (p + r))
    def all_rouge(self, hypothesis: str, reference: str) -> dict[str, float]:
        """
        Compute ROUGE-1, ROUGE-2, and ROUGE-L simultaneously.
        Returns {'rouge_1': float, 'rouge_2': float, 'rouge_l': float}.
        Efficient: tokenizes once and shares computation.
        """
        return {
            "rouge_1": self.rouge_n(hypothesis, reference, n=1),
            "rouge_2": self.rouge_n(hypothesis, reference, n=2),
            "rouge_l": self.rouge_l(hypothesis, reference),
    # ──── NLI Entailment ──────────────────────────────────────────────────────
    def nli_entailment(self, hypothesis: str, premise: str) -> float:
        """
        Compute NLI entailment score: P(premise entails hypothesis).
        Entailment direction: CONTEXT (premise) → ANSWER (hypothesis).
        We check whether the context logically supports the answer,
        not the other way around.
        With CrossEncoder NLI model (DeBERTa-v3):
          Labels: [contradiction, neutral, entailment]
          Returns softmax probability of the 'entailment' label.
        Without NLI model (fallback):
          Returns ROUGE-L(hypothesis, premise) as a weak entailment proxy.
          This approximates: "if the answer uses words from the context,
          it is likely entailed by it" — a coarse but functional heuristic.
        """
        if self._nli_model is not None:
            try:
                scores = self._nli_model.predict([(premise, hypothesis)])
                probs = np.exp(scores[0]) / (np.exp(scores[0]).sum() + 1e-10)
                return float(probs[2])  # entailment label index
            except Exception:
        # Fallback: ROUGE-L as entailment proxy
        return self.rouge_l(hypothesis, premise)
    # ──── Delta-R (Hallucination Score) ───────────────────────────────────────
    def delta_r(self, answer: str, context: str) -> tuple[float, float, float]:
        """
        Compute ΔR(answer, W*) = 1 − ROUGE-L(answer, W*) × NLI_entailment(answer, W*).
        Returns (delta_r, rouge_l_score, nli_score).
        The product ROUGE-L × NLI is the GROUNDING SCORE:
          Grounding → 1.0: perfectly faithful, both lexically and logically
          Grounding → 0.0: completely hallucinated
        ΔR = 1 − Grounding is the HALLUCINATION SCORE:
          ΔR → 0.0: perfectly grounded (accepted)
          ΔR → 1.0: completely hallucinated (rejected)
        """
        rouge = self.rouge_l(answer, context)
        nli   = self.nli_entailment(answer, context)
        grounding = rouge * nli
        return float(1.0 - grounding), float(rouge), float(nli)
    def grounding_score(self, answer: str, context: str) -> float:
        """
        Compute the grounding score: 1 − ΔR = ROUGE-L × NLI.
        Inverse of ΔR. Grounding ∈ [0, 1] where 1 = perfectly faithful.
        """
        dr, _, _ = self.delta_r(answer, context)
        return float(1.0 - dr)
    # ──── Core Verification API ───────────────────────────────────────────────
    def verify(
        self,
        answer: str,
        context: str,
        iteration: int = 1,
    ) -> FVResult:
        """
        Verify a single answer against its context string.
        Computes ΔR, ROUGE-1, ROUGE-2, ROUGE-L, NLI scores and determines
        whether the answer passes the faithfulness gate (ΔR ≤ δ_FV).
        """
        dr, rouge_l_score, nli_score = self.delta_r(answer, context)
        r1 = self.rouge_n(answer, context, n=1)
        r2 = self.rouge_n(answer, context, n=2)
        return FVResult(
            answer=answer,
            delta_r=round(dr, 4),
            rouge_l=round(rouge_l_score, 4),
            rouge_1=round(r1, 4),
            rouge_2=round(r2, 4),
            nli_score=round(nli_score, 4),
            grounding=round(1.0 - dr, 4),
            accepted=dr <= self.config.delta_fv,
            iteration=iteration,
    def verify_with_retry(
        self,
        context: str,
        generate_fn: Callable[[str, int], str],
    ) -> FVResult:
        """
        Verify with up to max_iterations regeneration attempts.
        generate_fn: callable(context_string, attempt_number) → answer_string
          Called with the same context each time. The attempt number allows
          generate_fn to vary temperature/sampling strategy across iterations.
        Strategy:
          1. Generate and verify answer
          2. If ΔR ≤ δ_FV: return immediately (success)
          3. If ΔR > δ_FV: store as best candidate, try again
          4. After max_iterations: return the best result (lowest ΔR seen)
        The "return best" strategy ensures answer quality is monotone —
        the final answer is never worse than any intermediate attempt.
        """
        best_result: FVResult | None = None
        for attempt in range(1, self.config.max_iterations + 1):
            answer = generate_fn(context, attempt)
            result = self.verify(answer, context, iteration=attempt)
            if best_result is None or result.delta_r < best_result.delta_r:
                best_result = result
            if result.accepted:
                return result
        return best_result  # type: ignore[return-value]
    # ──── Sentence-Level and Citation Analysis ────────────────────────────────
    def sentence_level_verify(
        self,
        answer: str,
        context: str,
    ) -> list[SentenceFVResult]:
        """
        Compute per-sentence ΔR scores for a multi-sentence answer.
        Splits the answer into sentences and verifies each against the full
        context string. Returns a SentenceFVResult for each sentence.
        Useful for:
          1. Identifying which specific claims are hallucinated
          2. Fine-grained feedback for targeted regeneration
          3. Sentence-level citation assignment
        A sentence with ΔR > δ_FV is flagged as a potential hallucination.
        """
        sentences = self._split_sentences(answer)
        results: list[SentenceFVResult] = []
        for sent in sentences:
            if len(sent.split()) < 3:  # skip very short fragments
                continue
            dr, rouge, nli = self.delta_r(sent, context)
            results.append(SentenceFVResult(
                sentence=sent,
                delta_r=round(dr, 4),
                rouge_l=round(rouge, 4),
                nli_score=round(nli, 4),
                accepted=dr <= self.config.delta_fv,
                best_citation="",  # populated by citation_trace if slots available
        return results
    def citation_trace(
        self,
        answer: str,
        context_slots: list[OrderedContextSlot],
    ) -> list[dict]:
        """
        Trace each answer sentence to its most supporting context chunk.
        For each sentence in the answer, computes ROUGE-L against each context
        slot and assigns the citation of the slot with the highest overlap.
        Returns a list of dicts:
            'sentence':         str   — the answer sentence
            'best_citation':    str   — [C1], [C2], etc.
            'best_rouge_l':     float — overlap with best-matching chunk
            'delta_r':          float — sentence-level ΔR
            'accepted':         bool  — whether sentence passes faithfulness
            'all_citations':    dict  — {[C1]: rouge_l, [C2]: rouge_l, ...}
        If no chunk achieves ROUGE-L > 0.1 for a sentence, it is flagged
        as potentially hallucinated (no clear source in W*).
        """
        if not context_slots:
            return []
        sentences = self._split_sentences(answer)
        results = []
        for sent in sentences:
            if len(sent.split()) < 3:
                continue
            # Build full context for ΔR
            full_context = " ".join(slot.chunk.chunk_text for slot in context_slots)
            dr, rouge_l_full, nli = self.delta_r(sent, full_context)
            # Per-chunk ROUGE-L for citation assignment
            citation_scores: dict[str, float] = {}
            for i, slot in enumerate(context_slots, start=1):
                rl = self.rouge_l(sent, slot.chunk.chunk_text)
                citation_scores[f"[C{i}]"] = round(rl, 4)
            best_citation = max(citation_scores, key=citation_scores.get)  # type: ignore[arg-type]
            best_rouge_l = citation_scores[best_citation]
            results.append({
                "sentence":      sent,
                "best_citation": best_citation,
                "best_rouge_l":  round(best_rouge_l, 4),
                "delta_r":       round(dr, 4),
                "accepted":      dr <= self.config.delta_fv,
                "all_citations": citation_scores,
        return results
    def grounding_report(self, answer: str, context: str) -> dict:
        """
        Comprehensive faithfulness report for an answer-context pair.
        Returns a full breakdown:
            'delta_r':              float — overall ΔR hallucination score
            'grounding_score':      float — 1 − ΔR
            'accepted':             bool  — ΔR ≤ δ_FV
            'rouge_l':              float — ROUGE-L F1
            'rouge_1':              float — ROUGE-1 F1
            'rouge_2':              float — ROUGE-2 F1
            'nli_score':            float — NLI entailment score
            'delta_fv_threshold':   float — δ_FV used
            'margin':               float — δ_FV − ΔR (positive = accepted)
            'n_answer_tokens':      int   — answer token count
            'n_context_tokens':     int   — context token count
            'n_sentences':          int   — number of answer sentences
            'n_hallucinated_sents': int   — sentences with ΔR > δ_FV
            'interpretation':       str   — human-readable verdict
        """
        dr, rouge, nli = self.delta_r(answer, context)
        r1 = self.rouge_n(answer, context, n=1)
        r2 = self.rouge_n(answer, context, n=2)
        sentence_results = self.sentence_level_verify(answer, context)
        n_hallucinated = sum(1 for s in sentence_results if not s.accepted)
        margin = self.config.delta_fv - dr
        if dr <= 0.05:
            interp = "Excellent faithfulness: both lexical and logical grounding are very high."
        elif dr <= self.config.delta_fv:
            interp = (
                f"Accepted: ΔR={dr:.3f} ≤ δ_FV={self.config.delta_fv}. "
                f"ROUGE-L={rouge:.3f}, NLI={nli:.3f}. Answer is grounded in W*."
        elif dr <= 0.35:
            interp = (
                f"Borderline: ΔR={dr:.3f} slightly exceeds δ_FV={self.config.delta_fv}. "
                f"ROUGE-L={rouge:.3f}, NLI={nli:.3f}. "
                f"{'NLI is the weak link.' if nli < rouge else 'ROUGE-L is the weak link.'} "
                f"Regeneration recommended."
        else:
            weak = "NLI entailment" if nli < rouge else "ROUGE-L overlap"
            interp = (
                f"Hallucination detected: ΔR={dr:.3f} >> δ_FV={self.config.delta_fv}. "
                f"Critical failure in {weak}. Answer should be rejected and regenerated. "
                f"{n_hallucinated}/{len(sentence_results)} sentences flagged."
        return {
            "delta_r":              round(dr, 4),
            "grounding_score":      round(1.0 - dr, 4),
            "accepted":             dr <= self.config.delta_fv,
            "rouge_l":              round(rouge, 4),
            "rouge_1":              round(r1, 4),
            "rouge_2":              round(r2, 4),
            "nli_score":            round(nli, 4),
            "delta_fv_threshold":   self.config.delta_fv,
            "margin":               round(margin, 4),
            "n_answer_tokens":      len(self._tokenize(answer)),
            "n_context_tokens":     len(self._tokenize(context)),
            "n_sentences":          len(sentence_results),
            "n_hallucinated_sents": n_hallucinated,
            "interpretation":       interp,
    def compare_answers(
        self,
        answers: list[str],
        context: str,
    ) -> list[FVResult]:
        """
        Verify multiple candidate answers against the same context.
        Returns FVResult list sorted by delta_r ascending (best first).
        Useful for selecting the most faithful answer from multiple LLM samples.
        """
        results = [
            self.verify(ans, context, iteration=i + 1)
            for i, ans in enumerate(answers)
        results.sort(key=lambda r: r.delta_r)
        return results
    def threshold_analysis(
        self,
        answer: str,
        context: str,
        thresholds: list[float] | None = None,
    ) -> list[dict]:
        """
        Show whether an answer would be accepted at various δ_FV thresholds.
        Returns [{threshold, accepted, margin}, ...] for each threshold.
        Useful for understanding how strict the faithfulness requirement
        would need to be to reject/accept a borderline answer.
        """
        if thresholds is None:
            thresholds = [0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.40, 0.50]
        dr, _, _ = self.delta_r(answer, context)
        return [
                "threshold": thresh,
                "accepted":  dr <= thresh,
                "margin":    round(thresh - dr, 4),
            for thresh in thresholds
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

fv.py

Latest commit

History

fv.py

File metadata and controls