fix(ce-plan): inline synthesis gate output into SKILL.md#822
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a24e092abe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…plate Address PR review feedback (#822) The inlined solo template dropped the "(You can also redirect to /ce-brainstorm if this is bigger than you initially thought...)" parenthetical that the reference template carries. Inlining was supposed to make the gate fire reliably without the reference loading, so dropping the escape-hatch line from the inline copy weakened the guardrail it was meant to preserve. Sync the inline template back to the reference's wording.
The Phase 0.7 / 5.1.5 synthesis gate was being skipped silently when the synthesis-summary.md reference did not load — the templates and mandatory-announce rule lived only there, behind a "STOP, read this" indirection that the agent could (and did) skip. Move the load-bearing pieces inline so the gate fires reliably even on a load-failure case, and reorder the gate block so the reference-load instruction is the first step. The reference now provides best-effort quality guidance for call-out shaping; the gate itself no longer depends on it loading. Also fix the inlined templates: replace "Phase 0.4 bootstrap" / "Phase 1 research" with user-facing language (users do not track phase numbers), reduce two-bullet placeholders to a single placeholder with explicit count guidance (the multi-bullet placeholder biased toward a fixed count), and add purpose context to the Stated / Inferred / Out of scope bucket names so it is clear they drive plan-body routing rather than chat output.
…plate Address PR review feedback (#822) The inlined solo template dropped the "(You can also redirect to /ce-brainstorm if this is bigger than you initially thought...)" parenthetical that the reference template carries. Inlining was supposed to make the gate fire reliably without the reference loading, so dropping the escape-hatch line from the inline copy weakened the guardrail it was meant to preserve. Sync the inline template back to the reference's wording.
Apply the shape and discipline changes from ce-brainstorm's scoping-synthesis fix (#829) to ce-plan's Phase 0.7 / 5.1.5: - Tier guard on auto-proceed: Lightweight + zero call-outs is the only path that skips the confirmation gate. Standard and Deep plans always fire the confirmation gate even with zero call-outs, because substance earns the checkpoint. A 1-3 line summary on a Deep plan is exactly the rubber-stamping case the gate is supposed to prevent. - Confirmation phrasing names what happens on confirm ("Confirm and I'll proceed to research, drawing on this scope" / "Confirm and I'll write the plan next..."), replacing the ambiguous "Confirm to proceed." - Detail test for each surviving call-out and summary bullet: 1-2 lines max, conversational not documentary. The count cap was gameable without it -- three call-outs could each be a 6-line paragraph and still "fit." - Re-cut rule extended to fire on detail overflow, not just count overflow. - Summary form is flexible: prose, bullets, or mix, whichever communicates best. Tier-aware budgets (Lightweight 1-3 lines; Standard 3-5 lines or 2-4 bullets; Deep 4-6 lines or 3-6 bullets). - Rename "Scope Summary" / "Synthesis Summary" to "Scoping Synthesis" for parity with ce-brainstorm's terminology. - Soft-cut option wording updated per the parity note in #819 (the "redirect" verb collided with the unrelated self-redirect mechanism). Skill doc updated -- the Quick Example referenced "short prose summary" and "the gate skips when there are no forks worth flagging," both of which would mislead a reader under the new behavior.
322c521 to
0c0afc8
Compare
The brainstorm-sourced synthesis was producing plan-pitch outputs that read like a Table of Contents for the plan body: enumerating Implementation Units, restating brainstorm constraints, and accounting for how deferred-Qs route into plan sections. None of that gives the user something to push back on; it just rubber-stamps work the brainstorm already validated. Restructure the brainstorm-sourced synthesis into two distinct content sections plus call-outs: 1. Brainstorm-scope restatement (1-2 sentences). The user wrote this content, but the synthesis may be read days later or in parallel with other plans. The restatement is the topic anchor that names which artifact is being planned against, in the brainstorm's own vocabulary. Not an enumeration. 2. Plan-specific scoping (prose or bullets). What this plan covers vs. defers vs. expands relative to the brainstorm: staging decisions, test scope, adjacent refactors. This is the part the user can actively push back on at plan-time. Solo plans have no upstream and the summary is a single scope claim. Other changes: - Tier budgets are reframed as ceilings, not targets. Filling the budget when there is not more substantive to say produces noise. - Source-document vocabulary discipline: when a brainstorm exists, use its terms; do not invent agent-coded shorthand like "skill-instruction shape" or "hooks engine selection at Step 2a entry" that forces the user to flip back and translate. - Both templates renamed and restructured to communicate the new shape via placeholder hints.
…scipline A test run of the new two-paragraph synthesis still surfaced plan-pitch leakage in three patterns the rules didn't yet block: 1. The agent claimed "one PR" — a sequencing decision plan-write produces, not something knowable at synthesis time. 2. "Plan-specific scoping" was enumerating where the implementation reaches into the codebase (file paths, Implementation Unit inventory) instead of stating scope-claim decisions. 3. Call-outs kept the 3-5 line "name fork, explain A, explain B, my default is X" rationale-dump shape, which is exactly what belongs in Key Technical Decisions in the plan body. Encode the underlying rule explicitly: the synthesis is composed before plan-write, so it can only surface what the agent knows from the brainstorm + research + posture commitments. Implementation Unit boundaries, PR count, commit/branch sequencing, effort estimates, and exact file paths are all plan-write outputs the synthesis cannot honestly claim. Even when the agent has formed plan-write opinions earlier in the session, those stay internal until plan-write. Other refinements: - Reword plan-specific scoping from "what this plan covers vs defers vs expands" to "scope-level decisions" — the "covers" framing was pulling agents toward inventory. - Make plan-specific scoping items pass the same affirmability test as call-outs: the user can affirm or redirect without reading code. - Strengthen the call-out template placeholder to forbid multi-sentence rationale and "my default is X" pitches. - Generalize the bare-ID anti-pattern in source-vocabulary discipline (AE4, R6, F3 all flip the user back to the brainstorm).
A test run showed two wording patterns the prior rules didn't block:
1. Bare ID references resurfaced ("AE1-AE3", "AE4", "AE5", "R6")
even when the cases were already named in plain terms in the same
sentence. Strengthen the source-vocabulary rule into a mechanical
pre-emit scan: before emitting, look for `AE\d+`, `R\d+`, `F\d+`,
`A\d+`, `U\d+` patterns and replace with plain names.
2. Numerical attestation ("all nine requirements, all three flows,
all five acceptance examples") read as the agent showing its
work — "covers the full brainstorm scope" already conveys the
claim and the count adds nothing the user can affirm. Add as a
named anti-pattern alongside synthesis-as-plan-pitch.
Both are wording-polish refinements on top of structural rules that
are now landing. Reference-only changes; no SKILL.md inline updates
needed since these refine quality, not gate firing.
Fresh-session test still produced "The touch surface is X (subpaths), Y
(subpaths), Z..." enumeration in paragraph 2 plus "all 9 requirements"
numerical attestation. The rules forbidding both were in the reference
but the touch-surface prohibition was buried in a comma-separated list
of NOTs and the file-path scan didn't exist yet.
Promote both to load-reliable inline placement in SKILL.md and add the
file-path scan as a pre-emit mechanical check:
- "Do NOT enumerate the touch surface" gets its own bold inline
paragraph in both Phase 0.7 and Phase 5.1.5. Names trigger phrases
("The touch surface is...", "This plan touches...", "The
implementation reaches into...", "Files modified include...") so the
agent recognizes the pattern even when the buried rule misses.
- Pre-emit scan rule expanded from bare-IDs-only to bare-IDs + file
paths. Same mechanical shape: before emitting, scan for `path/like.md`
/ `path/like.py` patterns and cut unless the path IS the topic of an
explicit fork in the call-outs.
- Reference section reorganized: source-vocabulary covers vocab choice;
a separate "Pre-emit mechanical checks" bullet groups both scans
with examples of allowed vs forbidden path usage.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 08476de9a7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Address PR review feedback (#822) I'd copy-pasted the same "Synthesis is pre-plan-write" rule into both Phase 0.7 (solo) and Phase 5.1.5 (brainstorm-sourced), naming "brainstorm + research + agent posture" as the inputs available at synthesis time. That's correct for Phase 5.1.5 (which fires after Phase 1 research and from an upstream brainstorm doc), but wrong for Phase 0.7: the solo variant fires when no brainstorm exists AND before Phase 1 research runs. Naming sources that aren't there can push the agent to fabricate grounding or overstate confidence. Phase 0.7 now names what's actually available — the user's request, the Phase 0.4 bootstrap dialogue, and the agent's internal three-bucket draft — and explicitly says Phase 1 research has not happened yet and there is no upstream brainstorm. Phase 5.1.5 unchanged; its phrasing is accurate for that variant.
Adds an eval suite that tests whether ce-sessions findings preserve terminology resolution context — specifically, whether distinctive coined terms and their resolution rationale survive the session-historian synthesis step intact. Four test cases with ground truth from recently merged PRs: - synthesis-gate-recovery (PR #822) — distinctive term recovery - mode-headless-semantic-alignment (PR #813) — multi-piece nuance - tangential-term-recovery — indexing-gap test - near-miss-false-positive — discriminating-power test Two-stage grader: programmatic substring match per criticality tier, plus LLM-graded context preservation. Variance protocol: 3 runs per eval. This suite was built during PR #838's design exploration to validate a load-bearing assumption (that ce-sessions findings could feed ce-compound Phase 2.4's vocabulary scan). That assumption was ultimately retired in favor of doc-and-conversation-only scanning, so the suite is not load-bearing for PR #838. Kept as future infrastructure for validating ce-sessions's behavior as the skill evolves — e.g., when changing the session-historian synthesis prompt or adjusting scan-window defaults. Iteration-1 results (executed via skill-creator framework, captured to /tmp/compound-engineering/ce-sessions/evals/iteration-1/) showed ce-sessions preserved terminology strongly across all 4 evals with 100% must-tier recall and 0% stddev — but this is a capability test of the skill in isolation, not a test of any specific integration.
The synthesis gate at Phase 0.7 / 5.1.5 now fires reliably without depending on the synthesis-summary.md reference loading. The literal templates and the "silent proceeding is not allowed" rule are inlined in SKILL.md so the gate output appears even on a load-failure case. The gate block leads with a firm read-the-reference instruction so call-outs are still well-shaped when the load succeeds.
The inlined templates also drop phase-number jargon from user-facing text ("Phase 0.4 bootstrap" became "our brief discussion"), reduce a two-bullet placeholder that was biasing call-out count, and add purpose context to the Stated / Inferred / Out of scope bucket names.