Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat(ce-review): improve signal-to-noise with confidence rubric, FP suppression, and intent verification#434

Merged
tmchow merged 11 commits into
mainfrom
feat/review-skill-improvements
Mar 29, 2026
Merged

feat(ce-review): improve signal-to-noise with confidence rubric, FP suppression, and intent verification#434
tmchow merged 11 commits into
mainfrom
feat/review-skill-improvements

Conversation

@tmchow
Copy link
Copy Markdown
Collaborator

@tmchow tmchow commented Mar 29, 2026

Summary

Reduces false-positive noise from ce:review's persona sub-agents by ~49% on clean code without any loss in sensitivity on buggy code. Adds a new reviewer persona for tracking prior review feedback across PR iterations.

What changed

Confidence rubric and false-positive suppression — The subagent template now includes an explicit 6-tier confidence rubric (0.00–1.00) with a 0.60 suppress threshold, replacing the previous per-persona "suppress below your confidence floor" instruction. Six false-positive categories are listed for active suppression: pre-existing issues, style nitpicks, intentional code patterns, issues handled elsewhere, restating existing code, and generic "consider adding" advice without a failure mode.

Intent verification via PR metadata — Sub-agents now receive a <pr-context> block with the PR title, body, and URL. They're instructed to compare code changes against stated intent and flag mismatches — catching cases where code does something the PR description doesn't mention, or fails to do something it promises.

Cross-reviewer agreement boosting — When 2+ independent reviewers flag the same issue, the merged confidence is boosted by 0.10 (capped at 1.0). Disagreements between reviewers on severity/routing are recorded in evidence for transparency.

Previous-comments reviewer — New conditional persona that checks whether prior review feedback has been addressed. Fetches PR review threads via gh API and hunts for unaddressed comments, partially addressed feedback, and regressions of prior fixes.

Model tiering — Persona sub-agents now use cheaper/faster models (Haiku in Claude Code, GPT-4o mini in Codex) while the orchestrator stays on the default model for synthesis work.

Bot-PR filtering — Headless/autofix modes skip bot-authored PRs (dependabot, renovate, github-actions, snyk-bot) to avoid wasting tokens on mechanical dependency bumps.

Findings schema refinement — Confidence thresholds expanded from 3 tiers to 4, with a P0 exception at 0.50+.

Benchmark results

Tested with 3 runs per configuration across two eval types: a planted-bug diff (real TypeScript code with known correctness issues) and a large-refactor diff (intentional cross-cutting changes with no bugs).

Sensitivity (does it still catch real bugs?)

Planted Bug New Skill Old Skill
deduplicateAgents no-op filter (logic error) 3/3 caught 3/3 caught
sanitizeDescription off-by-one (contract violation) 3/3 caught 3/3 caught

Specificity (does it reduce noise?)

Eval New Skill (mean +/- std) Old Skill (mean +/- std) Delta
Planted-bug (real bugs) 5.0 findings 4.7 findings comparable
Large-refactor (clean code) 1.7 +/- 0.9 3.3 +/- 0.5 -49%

Efficiency

Metric New Skill Old Skill Delta
Tokens/run 69.4K 71.9K -3.6%
Duration/run 48.7s 53.6s -9.2%
Tool uses/run 8.9 10.6 -16.5%

Test plan

  • Planted-bug eval: both planted bugs caught in 100% of runs (3/3 new, 3/3 old)
  • Large-refactor eval: new skill produces ~49% fewer findings on intentional changes
  • Feature-intent eval: new skill produces 0 findings vs 3 false positives from old skill
  • Existing contract tests pass (bun test)

Compound Engineering v2.58.1
🤖 Generated with Claude Opus 4.6 (1M context) via Claude Code

tmchow and others added 9 commits March 28, 2026 20:55
Persona agents do focused, scoped work and can use cheaper/faster models
(Haiku for Claude Code, GPT-4o mini for Codex) to reduce cost and
latency while keeping the orchestrator on the most capable model.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
When diffs exceed ~8k lines, instruct the orchestrator to chunk by
directory/module and spawn reviewers per chunk rather than passing an
oversized diff that degrades review quality.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… rules

Add named confidence bands (0.0-1.0 scale with clear thresholds) to the
subagent template so persona agents have calibrated scoring guidance.
Add explicit false-positive categories to suppress: pedantic nitpicks,
linter-catchable issues, intentional code, handled-elsewhere patterns,
and generic "consider" advice without failure modes.

Inspired by Anthropic's code-review skill which uses a 0-100 scale with
named bands and strict false-positive filtering at 80+ threshold.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…nt tracking

When 2+ reviewers independently flag the same issue, boost merged
confidence by 0.10. When reviewers disagree on severity or routing,
record the disagreement in evidence for transparency.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ation

Pass PR title/body/URL to every persona agent in a <pr-context> block
so they can verify code changes match stated intent. Add explicit
intent-verification rule: mismatches between what the PR says and what
the code does are high-value findings.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
New conditional persona that checks whether prior review comments on a
PR have been addressed. Fetches review threads via gh, compares against
the current diff, and flags unaddressed feedback. Selected when reviewing
a PR that has existing review comments.

Inspired by Anthropic's code-review skill Agent #4 (previous PR
comments check).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Skip or warn on PRs authored by dependency bots (dependabot, renovate,
snyk-bot, github-actions). In headless/autofix mode, skip silently. In
interactive mode, warn and ask for confirmation. In report-only mode,
proceed with minimal review.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Users won't split their PR mid-review. Just note the chunking in
Coverage instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… strategy

Bot-PR check: since ce:review is on-demand (not CI), if a user requests
a review they should get one. Keep the skip only for headless/autofix
where bot PRs waste tokens with no human to benefit.

Large-diff chunking: remove the invented per-module chunking strategy.
It wasn't from the Anthropic reference and hasn't been battle-tested.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2191dde209

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-review/references/subagent-template.md Outdated
Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated
tmchow and others added 2 commits March 29, 2026 00:29
…isting rule, and author metadata

- Fix contradictory pre_existing parenthetical in subagent template
  to align with diff-scope.md (newly relevant = secondary, not pre-existing)
- Add P0 exception (0.50+ confidence) to Stage 5 confidence gate
  so critical-but-uncertain findings aren't silently dropped
- Add author field to PR metadata fetch for bot-PR eligibility check
- Update Coverage template line to reflect P0 retention

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The hardcoded bot author list (dependabot, renovate, github-actions,
snyk-bot) is fragile and increasingly wrong as AI-authored PRs become
common. An aider[bot] or claude-code[bot] PR is not a mechanical
dependency bump. If someone requests a review, do the review.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@tmchow tmchow merged commit 03f5aa6 into main Mar 29, 2026
2 checks passed
@github-actions github-actions Bot mentioned this pull request Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant