HALLUCINATION RATE
0.00%
Across 70-case adversarial bench. Zero confabulated claims this run.
Every Monday at 08:00 UTC, GitHub Actions runs the full suite, commits results, and updates this page. No editorial filter, no spin. If a number drops, you see the drop. If a number improves, it's timestamped.
| Date | Rate |
|---|---|
| 2026-04-10 | 81.43% |
| 2026-04-12 | 97.14% |
| 2026-04-15 | 91.00% |
| Date | Rate |
|---|---|
| 2026-04-10 | 100.00% |
| 2026-04-12 | 100.00% |
| 2026-04-15 | 100.00% |
name: bench-weekly on: schedule: - cron: "0 8 * * 1" # Mondays 08:00 UTC workflow_dispatch: jobs: bench: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: cargo run --release --bin quality_bench -- --suite eval - run: cargo run --release --bin quality_bench -- --suite hard - run: python -m benchmarks.publish --out landing/benchmarks-data.json - uses: stefanzweifel/git-auto-commit-action@v5 with: commit_message: "chore(bench): weekly results ${{ github.run_number }}"
benchmarks/history/ in the public repo. If you want to verify, clone it and run git log --oneline benchmarks/history/. You'll see the bot's commits. No edits.See every run. Read the scorer. Run it yourself.