feat(ce-review): improve signal-to-noise with confidence rubric, FP suppression, and intent verification#434
Merged
Merged
Conversation
Persona agents do focused, scoped work and can use cheaper/faster models (Haiku for Claude Code, GPT-4o mini for Codex) to reduce cost and latency while keeping the orchestrator on the most capable model. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
When diffs exceed ~8k lines, instruct the orchestrator to chunk by directory/module and spawn reviewers per chunk rather than passing an oversized diff that degrades review quality. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… rules Add named confidence bands (0.0-1.0 scale with clear thresholds) to the subagent template so persona agents have calibrated scoring guidance. Add explicit false-positive categories to suppress: pedantic nitpicks, linter-catchable issues, intentional code, handled-elsewhere patterns, and generic "consider" advice without failure modes. Inspired by Anthropic's code-review skill which uses a 0-100 scale with named bands and strict false-positive filtering at 80+ threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…nt tracking When 2+ reviewers independently flag the same issue, boost merged confidence by 0.10. When reviewers disagree on severity or routing, record the disagreement in evidence for transparency. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ation Pass PR title/body/URL to every persona agent in a <pr-context> block so they can verify code changes match stated intent. Add explicit intent-verification rule: mismatches between what the PR says and what the code does are high-value findings. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
New conditional persona that checks whether prior review comments on a PR have been addressed. Fetches review threads via gh, compares against the current diff, and flags unaddressed feedback. Selected when reviewing a PR that has existing review comments. Inspired by Anthropic's code-review skill Agent #4 (previous PR comments check). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Skip or warn on PRs authored by dependency bots (dependabot, renovate, snyk-bot, github-actions). In headless/autofix mode, skip silently. In interactive mode, warn and ask for confirmation. In report-only mode, proceed with minimal review. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Users won't split their PR mid-review. Just note the chunking in Coverage instead. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… strategy Bot-PR check: since ce:review is on-demand (not CI), if a user requests a review they should get one. Keep the skip only for headless/autofix where bot PRs waste tokens with no human to benefit. Large-diff chunking: remove the invented per-module chunking strategy. It wasn't from the Anthropic reference and hasn't been battle-tested. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2191dde209
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…isting rule, and author metadata - Fix contradictory pre_existing parenthetical in subagent template to align with diff-scope.md (newly relevant = secondary, not pre-existing) - Add P0 exception (0.50+ confidence) to Stage 5 confidence gate so critical-but-uncertain findings aren't silently dropped - Add author field to PR metadata fetch for bot-PR eligibility check - Update Coverage template line to reflect P0 retention Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The hardcoded bot author list (dependabot, renovate, github-actions, snyk-bot) is fragile and increasingly wrong as AI-authored PRs become common. An aider[bot] or claude-code[bot] PR is not a mechanical dependency bump. If someone requests a review, do the review. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reduces false-positive noise from ce:review's persona sub-agents by ~49% on clean code without any loss in sensitivity on buggy code. Adds a new reviewer persona for tracking prior review feedback across PR iterations.
What changed
Confidence rubric and false-positive suppression — The subagent template now includes an explicit 6-tier confidence rubric (0.00–1.00) with a 0.60 suppress threshold, replacing the previous per-persona "suppress below your confidence floor" instruction. Six false-positive categories are listed for active suppression: pre-existing issues, style nitpicks, intentional code patterns, issues handled elsewhere, restating existing code, and generic "consider adding" advice without a failure mode.
Intent verification via PR metadata — Sub-agents now receive a
<pr-context>block with the PR title, body, and URL. They're instructed to compare code changes against stated intent and flag mismatches — catching cases where code does something the PR description doesn't mention, or fails to do something it promises.Cross-reviewer agreement boosting — When 2+ independent reviewers flag the same issue, the merged confidence is boosted by 0.10 (capped at 1.0). Disagreements between reviewers on severity/routing are recorded in evidence for transparency.
Previous-comments reviewer — New conditional persona that checks whether prior review feedback has been addressed. Fetches PR review threads via
ghAPI and hunts for unaddressed comments, partially addressed feedback, and regressions of prior fixes.Model tiering — Persona sub-agents now use cheaper/faster models (Haiku in Claude Code, GPT-4o mini in Codex) while the orchestrator stays on the default model for synthesis work.
Bot-PR filtering — Headless/autofix modes skip bot-authored PRs (dependabot, renovate, github-actions, snyk-bot) to avoid wasting tokens on mechanical dependency bumps.
Findings schema refinement — Confidence thresholds expanded from 3 tiers to 4, with a P0 exception at 0.50+.
Benchmark results
Tested with 3 runs per configuration across two eval types: a planted-bug diff (real TypeScript code with known correctness issues) and a large-refactor diff (intentional cross-cutting changes with no bugs).
Sensitivity (does it still catch real bugs?)
deduplicateAgentsno-op filter (logic error)sanitizeDescriptionoff-by-one (contract violation)Specificity (does it reduce noise?)
Efficiency
Test plan
bun test)🤖 Generated with Claude Opus 4.6 (1M context) via Claude Code