Thanks to visit codestin.com
Credit goes to wauldo.com

// weekly quality runs · versioned in git

Benchmarks that commit themselves.

Every Monday at 08:00 UTC, GitHub Actions runs the full suite, commits results, and updates this page. No editorial filter, no spin. If a number drops, you see the drop. If a number improves, it's timestamped.

// last run · 2026-04-15

The weekly headline.

HALLUCINATION RATE
0.00%
Across 70-case adversarial bench. Zero confabulated claims this run.
ADVERSARIAL PASS
97.14%
Best run. Median across 4 runs: 91%. Range: 86–97.
P50 LATENCY
1,566ms
Agent run end-to-end. Fast path /v1/fact-check: 5ms. Includes routing, LLM, verification.
// weekly trend

The bot commits. The chart updates.

Adversarial pass rate
DateRate
2026-04-1081.43%
2026-04-1297.14%
2026-04-1591.00%
Median line. Best run landed Apr 12.
Factual retrieval
DateRate
2026-04-10100.00%
2026-04-12100.00%
2026-04-15100.00%
Factual floor holds — zero regression in 3 weekly runs.
// four dimensions

What the suite actually runs.

01 · ADVERSARIAL
70 hand-crafted cases probing prompt injection, out-of-scope, contradiction, and multi-hop reasoning.
02 · FACTUAL
Closed-domain retrieval on a held-out corpus. Zero hallucination is the pass bar.
03 · LATENCY
p50 and p95 across 1000 requests. Fast path isolated from full LLM path.
04 · COST
Token count multiplied by provider price. Published in micro-USD for reproducibility.
// how it runs

Two GitHub Actions. One commit.

name: bench-weekly
on:
  schedule:
    - cron: "0 8 * * 1"   # Mondays 08:00 UTC
  workflow_dispatch:

jobs:
  bench:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: cargo run --release --bin quality_bench -- --suite eval
      - run: cargo run --release --bin quality_bench -- --suite hard
      - run: python -m benchmarks.publish --out landing/benchmarks-data.json
      - uses: stefanzweifel/git-auto-commit-action@v5
        with:
          commit_message: "chore(bench): weekly results ${{ github.run_number }}"
ALL RUNS PUBLIC
Every run's raw output is in benchmarks/history/ in the public repo. If you want to verify, clone it and run git log --oneline benchmarks/history/. You'll see the bot's commits. No edits.
View history on GitHub →
// related

Elsewhere on the site.

// methodology first

The numbers shouldn't be a mystery.

See every run. Read the scorer. Run it yourself.

$ cargo run quality_bench --suite eval