Codestin Search App

tmchow · 2026-03-29T07:21:53Z

Summary

Reduces false-positive noise from ce:review's persona sub-agents by ~49% on clean code without any loss in sensitivity on buggy code. Adds a new reviewer persona for tracking prior review feedback across PR iterations.

What changed

Confidence rubric and false-positive suppression — The subagent template now includes an explicit 6-tier confidence rubric (0.00–1.00) with a 0.60 suppress threshold, replacing the previous per-persona "suppress below your confidence floor" instruction. Six false-positive categories are listed for active suppression: pre-existing issues, style nitpicks, intentional code patterns, issues handled elsewhere, restating existing code, and generic "consider adding" advice without a failure mode.

Intent verification via PR metadata — Sub-agents now receive a <pr-context> block with the PR title, body, and URL. They're instructed to compare code changes against stated intent and flag mismatches — catching cases where code does something the PR description doesn't mention, or fails to do something it promises.

Cross-reviewer agreement boosting — When 2+ independent reviewers flag the same issue, the merged confidence is boosted by 0.10 (capped at 1.0). Disagreements between reviewers on severity/routing are recorded in evidence for transparency.

Previous-comments reviewer — New conditional persona that checks whether prior review feedback has been addressed. Fetches PR review threads via gh API and hunts for unaddressed comments, partially addressed feedback, and regressions of prior fixes.

Model tiering — Persona sub-agents now use cheaper/faster models (Haiku in Claude Code, GPT-4o mini in Codex) while the orchestrator stays on the default model for synthesis work.

Bot-PR filtering — Headless/autofix modes skip bot-authored PRs (dependabot, renovate, github-actions, snyk-bot) to avoid wasting tokens on mechanical dependency bumps.

Findings schema refinement — Confidence thresholds expanded from 3 tiers to 4, with a P0 exception at 0.50+.

Benchmark results

Tested with 3 runs per configuration across two eval types: a planted-bug diff (real TypeScript code with known correctness issues) and a large-refactor diff (intentional cross-cutting changes with no bugs).

Sensitivity (does it still catch real bugs?)

Planted Bug	New Skill	Old Skill
`deduplicateAgents` no-op filter (logic error)	3/3 caught	3/3 caught
`sanitizeDescription` off-by-one (contract violation)	3/3 caught	3/3 caught

Specificity (does it reduce noise?)

Eval	New Skill (mean +/- std)	Old Skill (mean +/- std)	Delta
Planted-bug (real bugs)	5.0 findings	4.7 findings	comparable
Large-refactor (clean code)	1.7 +/- 0.9	3.3 +/- 0.5	-49%

Efficiency

Metric	New Skill	Old Skill	Delta
Tokens/run	69.4K	71.9K	-3.6%
Duration/run	48.7s	53.6s	-9.2%
Tool uses/run	8.9	10.6	-16.5%

Test plan

Planted-bug eval: both planted bugs caught in 100% of runs (3/3 new, 3/3 old)
Large-refactor eval: new skill produces ~49% fewer findings on intentional changes
Feature-intent eval: new skill produces 0 findings vs 3 false positives from old skill
Existing contract tests pass (bun test)

🤖 Generated with Claude Opus 4.6 (1M context) via Claude Code

Persona agents do focused, scoped work and can use cheaper/faster models (Haiku for Claude Code, GPT-4o mini for Codex) to reduce cost and latency while keeping the orchestrator on the most capable model. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

When diffs exceed ~8k lines, instruct the orchestrator to chunk by directory/module and spawn reviewers per chunk rather than passing an oversized diff that degrades review quality. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

… rules Add named confidence bands (0.0-1.0 scale with clear thresholds) to the subagent template so persona agents have calibrated scoring guidance. Add explicit false-positive categories to suppress: pedantic nitpicks, linter-catchable issues, intentional code, handled-elsewhere patterns, and generic "consider" advice without failure modes. Inspired by Anthropic's code-review skill which uses a 0-100 scale with named bands and strict false-positive filtering at 80+ threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…nt tracking When 2+ reviewers independently flag the same issue, boost merged confidence by 0.10. When reviewers disagree on severity or routing, record the disagreement in evidence for transparency. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…ation Pass PR title/body/URL to every persona agent in a <pr-context> block so they can verify code changes match stated intent. Add explicit intent-verification rule: mismatches between what the PR says and what the code does are high-value findings. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

New conditional persona that checks whether prior review comments on a PR have been addressed. Fetches review threads via gh, compares against the current diff, and flags unaddressed feedback. Selected when reviewing a PR that has existing review comments. Inspired by Anthropic's code-review skill Agent #4 (previous PR comments check). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Skip or warn on PRs authored by dependency bots (dependabot, renovate, snyk-bot, github-actions). In headless/autofix mode, skip silently. In interactive mode, warn and ask for confirmation. In report-only mode, proceed with minimal review. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Users won't split their PR mid-review. Just note the chunking in Coverage instead. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

… strategy Bot-PR check: since ce:review is on-demand (not CI), if a user requests a review they should get one. Keep the skip only for headless/autofix where bot PRs waste tokens with no human to benefit. Large-diff chunking: remove the invented per-module chunking strategy. It wasn't from the Anthropic reference and hasn't been battle-tested. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2191dde209

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…isting rule, and author metadata - Fix contradictory pre_existing parenthetical in subagent template to align with diff-scope.md (newly relevant = secondary, not pre-existing) - Add P0 exception (0.50+ confidence) to Stage 5 confidence gate so critical-but-uncertain findings aren't silently dropped - Add author field to PR metadata fetch for bot-PR eligibility check - Update Coverage template line to reflect P0 retention Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

The hardcoded bot author list (dependabot, renovate, github-actions, snyk-bot) is fragile and increasingly wrong as AI-authored PRs become common. An aider[bot] or claude-code[bot] PR is not a mechanical dependency bump. If someone requests a review, do the review. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

tmchow and others added 9 commits March 28, 2026 20:55

fix(ce-review): remove unrealistic large-diff warning suggestion

e80072a

Users won't split their PR mid-review. Just note the chunking in Coverage instead. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

chatgpt-codex-connector Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread plugins/compound-engineering/skills/ce-review/references/subagent-template.md Outdated

Comment thread plugins/compound-engineering/skills/ce-review/references/findings-schema.json

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated

tmchow and others added 2 commits March 29, 2026 00:29

tmchow merged commit 03f5aa6 into main Mar 29, 2026
2 checks passed

github-actions Bot mentioned this pull request Mar 29, 2026

chore: release main #435

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ce-review): improve signal-to-noise with confidence rubric, FP suppression, and intent verification#434

feat(ce-review): improve signal-to-noise with confidence rubric, FP suppression, and intent verification#434
tmchow merged 11 commits into
mainfrom
feat/review-skill-improvements

tmchow commented Mar 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tmchow commented Mar 29, 2026

Summary

What changed

Benchmark results

Sensitivity (does it still catch real bugs?)

Specificity (does it reduce noise?)

Efficiency

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant