Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix(ce-plan): inline synthesis gate output into SKILL.md#822

Merged
tmchow merged 8 commits into
mainfrom
tmchow/ce-plan-synthesis-gate
May 15, 2026
Merged

fix(ce-plan): inline synthesis gate output into SKILL.md#822
tmchow merged 8 commits into
mainfrom
tmchow/ce-plan-synthesis-gate

Conversation

@tmchow
Copy link
Copy Markdown
Collaborator

@tmchow tmchow commented May 12, 2026

The synthesis gate at Phase 0.7 / 5.1.5 now fires reliably without depending on the synthesis-summary.md reference loading. The literal templates and the "silent proceeding is not allowed" rule are inlined in SKILL.md so the gate output appears even on a load-failure case. The gate block leads with a firm read-the-reference instruction so call-outs are still well-shaped when the load succeeds.

The inlined templates also drop phase-number jargon from user-facing text ("Phase 0.4 bootstrap" became "our brief discussion"), reduce a two-bullet placeholder that was biasing call-out count, and add purpose context to the Stated / Inferred / Out of scope bucket names.


Compound Engineering
Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a24e092abe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-plan/SKILL.md Outdated
tmchow added a commit that referenced this pull request May 12, 2026
…plate

Address PR review feedback (#822)

The inlined solo template dropped the "(You can also redirect to
/ce-brainstorm if this is bigger than you initially thought...)"
parenthetical that the reference template carries. Inlining was
supposed to make the gate fire reliably without the reference loading,
so dropping the escape-hatch line from the inline copy weakened the
guardrail it was meant to preserve. Sync the inline template back to
the reference's wording.
tmchow added 3 commits May 14, 2026 12:06
The Phase 0.7 / 5.1.5 synthesis gate was being skipped silently when
the synthesis-summary.md reference did not load — the templates and
mandatory-announce rule lived only there, behind a "STOP, read this"
indirection that the agent could (and did) skip. Move the load-bearing
pieces inline so the gate fires reliably even on a load-failure case,
and reorder the gate block so the reference-load instruction is the
first step. The reference now provides best-effort quality guidance
for call-out shaping; the gate itself no longer depends on it loading.

Also fix the inlined templates: replace "Phase 0.4 bootstrap" /
"Phase 1 research" with user-facing language (users do not track phase
numbers), reduce two-bullet placeholders to a single placeholder with
explicit count guidance (the multi-bullet placeholder biased toward a
fixed count), and add purpose context to the Stated / Inferred /
Out of scope bucket names so it is clear they drive plan-body routing
rather than chat output.
…plate

Address PR review feedback (#822)

The inlined solo template dropped the "(You can also redirect to
/ce-brainstorm if this is bigger than you initially thought...)"
parenthetical that the reference template carries. Inlining was
supposed to make the gate fire reliably without the reference loading,
so dropping the escape-hatch line from the inline copy weakened the
guardrail it was meant to preserve. Sync the inline template back to
the reference's wording.
Apply the shape and discipline changes from ce-brainstorm's
scoping-synthesis fix (#829) to ce-plan's Phase 0.7 / 5.1.5:

- Tier guard on auto-proceed: Lightweight + zero call-outs is the only
  path that skips the confirmation gate. Standard and Deep plans always
  fire the confirmation gate even with zero call-outs, because substance
  earns the checkpoint. A 1-3 line summary on a Deep plan is exactly
  the rubber-stamping case the gate is supposed to prevent.
- Confirmation phrasing names what happens on confirm ("Confirm and
  I'll proceed to research, drawing on this scope" / "Confirm and
  I'll write the plan next..."), replacing the ambiguous "Confirm
  to proceed."
- Detail test for each surviving call-out and summary bullet: 1-2
  lines max, conversational not documentary. The count cap was
  gameable without it -- three call-outs could each be a 6-line
  paragraph and still "fit."
- Re-cut rule extended to fire on detail overflow, not just count
  overflow.
- Summary form is flexible: prose, bullets, or mix, whichever
  communicates best. Tier-aware budgets (Lightweight 1-3 lines;
  Standard 3-5 lines or 2-4 bullets; Deep 4-6 lines or 3-6 bullets).
- Rename "Scope Summary" / "Synthesis Summary" to "Scoping Synthesis"
  for parity with ce-brainstorm's terminology.
- Soft-cut option wording updated per the parity note in #819 (the
  "redirect" verb collided with the unrelated self-redirect mechanism).

Skill doc updated -- the Quick Example referenced "short prose
summary" and "the gate skips when there are no forks worth flagging,"
both of which would mislead a reader under the new behavior.
@tmchow tmchow force-pushed the tmchow/ce-plan-synthesis-gate branch from 322c521 to 0c0afc8 Compare May 14, 2026 19:07
tmchow added 4 commits May 14, 2026 12:41
The brainstorm-sourced synthesis was producing plan-pitch outputs that
read like a Table of Contents for the plan body: enumerating
Implementation Units, restating brainstorm constraints, and accounting
for how deferred-Qs route into plan sections. None of that gives the
user something to push back on; it just rubber-stamps work the
brainstorm already validated.

Restructure the brainstorm-sourced synthesis into two distinct
content sections plus call-outs:

1. Brainstorm-scope restatement (1-2 sentences). The user wrote
   this content, but the synthesis may be read days later or in
   parallel with other plans. The restatement is the topic anchor
   that names which artifact is being planned against, in the
   brainstorm's own vocabulary. Not an enumeration.
2. Plan-specific scoping (prose or bullets). What this plan covers
   vs. defers vs. expands relative to the brainstorm: staging
   decisions, test scope, adjacent refactors. This is the part the
   user can actively push back on at plan-time.

Solo plans have no upstream and the summary is a single scope claim.

Other changes:

- Tier budgets are reframed as ceilings, not targets. Filling the
  budget when there is not more substantive to say produces noise.
- Source-document vocabulary discipline: when a brainstorm exists,
  use its terms; do not invent agent-coded shorthand like
  "skill-instruction shape" or "hooks engine selection at Step 2a
  entry" that forces the user to flip back and translate.
- Both templates renamed and restructured to communicate the new
  shape via placeholder hints.
…scipline

A test run of the new two-paragraph synthesis still surfaced
plan-pitch leakage in three patterns the rules didn't yet block:

1. The agent claimed "one PR" — a sequencing decision plan-write
   produces, not something knowable at synthesis time.
2. "Plan-specific scoping" was enumerating where the implementation
   reaches into the codebase (file paths, Implementation Unit
   inventory) instead of stating scope-claim decisions.
3. Call-outs kept the 3-5 line "name fork, explain A, explain B,
   my default is X" rationale-dump shape, which is exactly what
   belongs in Key Technical Decisions in the plan body.

Encode the underlying rule explicitly: the synthesis is composed
before plan-write, so it can only surface what the agent knows from
the brainstorm + research + posture commitments. Implementation Unit
boundaries, PR count, commit/branch sequencing, effort estimates,
and exact file paths are all plan-write outputs the synthesis cannot
honestly claim. Even when the agent has formed plan-write opinions
earlier in the session, those stay internal until plan-write.

Other refinements:

- Reword plan-specific scoping from "what this plan covers vs defers
  vs expands" to "scope-level decisions" — the "covers" framing was
  pulling agents toward inventory.
- Make plan-specific scoping items pass the same affirmability test
  as call-outs: the user can affirm or redirect without reading code.
- Strengthen the call-out template placeholder to forbid
  multi-sentence rationale and "my default is X" pitches.
- Generalize the bare-ID anti-pattern in source-vocabulary discipline
  (AE4, R6, F3 all flip the user back to the brainstorm).
A test run showed two wording patterns the prior rules didn't block:

1. Bare ID references resurfaced ("AE1-AE3", "AE4", "AE5", "R6")
   even when the cases were already named in plain terms in the same
   sentence. Strengthen the source-vocabulary rule into a mechanical
   pre-emit scan: before emitting, look for `AE\d+`, `R\d+`, `F\d+`,
   `A\d+`, `U\d+` patterns and replace with plain names.
2. Numerical attestation ("all nine requirements, all three flows,
   all five acceptance examples") read as the agent showing its
   work — "covers the full brainstorm scope" already conveys the
   claim and the count adds nothing the user can affirm. Add as a
   named anti-pattern alongside synthesis-as-plan-pitch.

Both are wording-polish refinements on top of structural rules that
are now landing. Reference-only changes; no SKILL.md inline updates
needed since these refine quality, not gate firing.
Fresh-session test still produced "The touch surface is X (subpaths), Y
(subpaths), Z..." enumeration in paragraph 2 plus "all 9 requirements"
numerical attestation. The rules forbidding both were in the reference
but the touch-surface prohibition was buried in a comma-separated list
of NOTs and the file-path scan didn't exist yet.

Promote both to load-reliable inline placement in SKILL.md and add the
file-path scan as a pre-emit mechanical check:

- "Do NOT enumerate the touch surface" gets its own bold inline
  paragraph in both Phase 0.7 and Phase 5.1.5. Names trigger phrases
  ("The touch surface is...", "This plan touches...", "The
  implementation reaches into...", "Files modified include...") so the
  agent recognizes the pattern even when the buried rule misses.
- Pre-emit scan rule expanded from bare-IDs-only to bare-IDs + file
  paths. Same mechanical shape: before emitting, scan for `path/like.md`
  / `path/like.py` patterns and cut unless the path IS the topic of an
  explicit fork in the call-outs.
- Reference section reorganized: source-vocabulary covers vocab choice;
  a separate "Pre-emit mechanical checks" bullet groups both scans
  with examples of allowed vs forbidden path usage.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08476de9a7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-plan/SKILL.md Outdated
Address PR review feedback (#822)

I'd copy-pasted the same "Synthesis is pre-plan-write" rule into both
Phase 0.7 (solo) and Phase 5.1.5 (brainstorm-sourced), naming
"brainstorm + research + agent posture" as the inputs available at
synthesis time. That's correct for Phase 5.1.5 (which fires after
Phase 1 research and from an upstream brainstorm doc), but wrong for
Phase 0.7: the solo variant fires when no brainstorm exists AND
before Phase 1 research runs. Naming sources that aren't there can
push the agent to fabricate grounding or overstate confidence.

Phase 0.7 now names what's actually available — the user's request,
the Phase 0.4 bootstrap dialogue, and the agent's internal
three-bucket draft — and explicitly says Phase 1 research has not
happened yet and there is no upstream brainstorm. Phase 5.1.5
unchanged; its phrasing is accurate for that variant.
@tmchow tmchow merged commit 39cb9da into main May 15, 2026
2 checks passed
@github-actions github-actions Bot mentioned this pull request May 15, 2026
tmchow added a commit that referenced this pull request May 18, 2026
Adds an eval suite that tests whether ce-sessions findings preserve
terminology resolution context — specifically, whether distinctive
coined terms and their resolution rationale survive the
session-historian synthesis step intact.

Four test cases with ground truth from recently merged PRs:
- synthesis-gate-recovery (PR #822) — distinctive term recovery
- mode-headless-semantic-alignment (PR #813) — multi-piece nuance
- tangential-term-recovery — indexing-gap test
- near-miss-false-positive — discriminating-power test

Two-stage grader: programmatic substring match per criticality tier,
plus LLM-graded context preservation. Variance protocol: 3 runs per
eval.

This suite was built during PR #838's design exploration to validate
a load-bearing assumption (that ce-sessions findings could feed
ce-compound Phase 2.4's vocabulary scan). That assumption was
ultimately retired in favor of doc-and-conversation-only scanning,
so the suite is not load-bearing for PR #838. Kept as future
infrastructure for validating ce-sessions's behavior as the skill
evolves — e.g., when changing the session-historian synthesis prompt
or adjusting scan-window defaults.

Iteration-1 results (executed via skill-creator framework, captured
to /tmp/compound-engineering/ce-sessions/evals/iteration-1/) showed
ce-sessions preserved terminology strongly across all 4 evals with
100% must-tier recall and 0% stddev — but this is a capability test
of the skill in isolation, not a test of any specific integration.
LLMpsycho pushed a commit to LLMpsycho/compound-engineering-plugin that referenced this pull request May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant