Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix(session-historian): cap deep-dives, add keyword filter primitive, tighten dispatch#699

Merged
tmchow merged 7 commits into
mainfrom
fix/ce-session-historian-sparse-history
Apr 26, 2026
Merged

fix(session-historian): cap deep-dives, add keyword filter primitive, tighten dispatch#699
tmchow merged 7 commits into
mainfrom
fix/ce-session-historian-sparse-history

Conversation

@tmchow
Copy link
Copy Markdown
Collaborator

@tmchow tmchow commented Apr 26, 2026

Summary

ce-session-historian was running 17+ minutes with 33 tool calls and stream-idle-timing-out on dispatches with no relevant prior sessions, where the correct answer should arrive in under a minute. The agent had no way to filter sessions by content efficiently, treated repo membership as sufficient relevance signal, and received a verbose dispatch prompt that licensed it to keep widening the search.

Fixes #696

What changed

ce-session-inventory gains a --keyword filter

extract-metadata.py adds an opt-in --keyword K1[,K2,...] mode that scans only user and assistant text content (not JSONL metadata, tool calls, or thinking blocks), filters out zero-match sessions, and emits per-session match_count plus per-keyword counts. This replaces the pattern of hand-rolling per-file grep -l invocations across multi-MB JSONL files.

The keyword scan runs only after --cwd-filter passes, so cross-repo Codex sessions are not scanned just to be discarded. The empty-input branch emits files_matched: 0 when --keyword was supplied, so callers that gate on it can short-circuit cleanly. For Codex sessions specifically, the <system_instruction>...</system_instruction> wrapper is stripped before counting, so environment terms like "Conductor" do not false-match.

ce-session-historian adds a relevance gate before extraction

Step 3 is now an explicit numbered decision sequence: branch filter first; if zero candidates, run --keyword and stop on files_matched: 0 without extracting; otherwise apply the hard cap of 5 deep-dives and extract. A new top-level guardrail (Never extract a session to verify whether it is relevant) makes the gate binding, since prose-level priority lists did not prevent extract-to-verify behavior. Tail extraction is now conditional, only invoked when head:200 terminates mid-investigation. The gitBranch caveat (captured at the first user message only, so branch-miss is not conclusive) is documented inline. The time budget drops the minute target in favor of "stop when complete"; structural caps in Step 3 and Step 4 bound runtime by construction.

/ce-compound dispatch tightens

The Session Historian dispatch in Phase 1 was a long context block with topic-keyword bullets that licensed widening. The new dispatch is a 5-field schema: pre-resolved repo and branch, 7-day window, one-sentence problem topic, one-line filter rule, fixed output schema. Pre-resolution uses a case on absolute-vs-relative output from git rev-parse --git-common-dir, which correctly handles repo root, subdirectory, and linked-worktree invocations.

AGENTS.md gains a "Validating Agent and Skill Changes" section

The section documents how to test agent and skill changes correctly: use the skill-creator skill, which spawns a generic subagent and injects content from disk at dispatch time. Both plugin agents and skills cache at session start, so dispatching the typed agent or invoking via the Skill tool inside the same session tests cached pre-edit content. Editing ~/.claude/plugins/cache/ or ~/.claude/plugins/marketplaces/ to force a reload is explicitly called out as wrong.

Test plan

tests/session-history-scripts.test.ts adds 12 new cases under --keyword mode covering single-keyword filtering, zero-match exclusion, OR semantics, case insensitivity, content-only scanning (sessionId / gitBranch / tool names do not false-match), CWD-filter ordering with --keyword, empty-input emits files_matched: 0, and the Codex system_instruction strip. bun test passes 951/951; bun run release:validate clean.

Agent prose changes are validated via the skill-creator pattern documented in the new AGENTS.md section: spawn a general-purpose subagent with the agent definition injected from disk. Three scenarios were exercised during this work (sparse-mismatch, branch-match success, keyword-match without branch-match), all behaved as designed.


Compound Engineering
Claude Code

… tighten dispatch

Sparse-history dispatches to ce-session-historian were running 17+ minutes
with 33 tool calls before stream-idle-timing-out, when the correct answer
("no relevant prior sessions") should arrive in under a minute. Six gaps
were compounding:

- No skill primitive for "search inventory by keyword across sessions",
  forcing the agent to roll 20 by-hand `grep -l` invocations across MB-sized
  JSONL files.
- Soft "typically 2-5 sessions" guidance with no hard cap.
- Tail-after-head extraction allowed unconditionally; ran reflexively on
  half of selected sessions.
- Verbose dispatch prompt from /ce-compound with topic-keyword bullets that
  licensed the agent to keep widening.
- Repo name not pre-resolved by caller; agent burned its first turn deriving
  via git rev-parse.
- No wall-time budget anywhere in the agent.

This change:

- Adds an opt-in --keyword K1[,K2,...] mode to ce-session-inventory's
  extract-metadata.py. When set, the script does a full-file case-insensitive
  substring scan, filters out zero-match sessions, and emits per-session
  match_count plus per-keyword counts. _meta gains files_matched.
- Tightens ce-session-historian: 5-7 min wall budget, hard cap of 5
  deep-dives, conditional tail-extract, gitBranch start-of-session
  limitation note, Step 3 #4 now points to --keyword instead of by-hand grep.
- Tightens /ce-compound: pre-resolves repo + branch via backtick syntax,
  rewrites the historian dispatch block as a 5-field schema (pre-resolved
  context, time window, problem topic, filter rule, output schema).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d753c08ea

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-compound/SKILL.md Outdated
tmchow and others added 2 commits April 25, 2026 22:24
…d evals scaffolding

The first iteration on session-historian (this PR's prior commit) tightened
prose around extraction caps, conditional tail extraction, and a keyword-search
hint, but did not stop the agent's strongest default behavior: when the
candidate list is small enough to fit under the cap, the agent extracts every
returned session "to verify" rather than running the keyword filter first. On
sparse-mismatch dispatches that means deep-extracting 3 unrelated sessions
instead of issuing one --keyword invocation that returns 0 matches in a single
call.

This commit replaces the soft Step 3 priority list with an explicit numbered
decision sequence:
- Branch filter first.
- If branch filter is empty, run ce-session-inventory --keyword with keywords
  derived from the dispatch problem topic.
- If files_matched is 0, return "no relevant prior sessions" and STOP --
  ce-session-extract is not invoked.

Step 4 gains an explicit "only run if Step 3 produced selected sessions"
guard, and a new guardrail at the top forbids extraction-to-verify outright.
The time-budget block drops the 5-7 minute target wording (which read as a
target rather than a max) in favor of "stop when complete; structural caps
bound runtime by construction."

Adds top-level evals/ for repo-only LLM-driven behavioral checks. The
session-historian eval covers the sparse-history scenario via synthetic
~/.claude/projects/-tmp-eval-... fixtures and a generic-subagent dispatch
pattern (inject the agent definition into a general-purpose subagent rather
than dispatching the typed agent, which reads cached definitions from
session start). Documents the cache caveat in evals/README.md.

Validated on the sparse-mismatch scenario:
- Before: 4 tool calls (inventory + 3 deep extracts), agent ignored keyword
  filter even with explicit prose guidance.
- After: 2 tool calls (inventory + --keyword filter), zero deep extractions,
  correct response. Wall time 55s, well under the 60s soft target. Full
  results in evals/session-historian/results-2026-04-25.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ll changes

When iterating on a plugin agent or skill, dispatching the typed agent in the
same session does not test repo edits — the agent definition is loaded once
at session start and cached in memory. The previous PR commit added an
evals/ scaffold but pointed to "restart the session" as the validation path,
which is correct but heavy. The right tool is the skill-creator skill:
it spawns a generic subagent and injects the agent or skill content into the
subagent's prompt at dispatch time, so each run reads from current disk and
iteration works inside a single session.

Adds a "Validating Agent and Skill Changes" section to repo-root AGENTS.md
that names skill-creator as the primary path, calls out the session-start
caching gotcha, and explicitly warns against editing ~/.claude/plugins/
to try to force a reload — that path was tried during this work, did not
bypass the cache, and is not a valid testing technique.

Updates evals/README.md and the session-historian eval docs to point to
skill-creator first, restart as a fallback only, and to reframe the
"file-sync under ~/.claude" narrative so it doesn't read as a recommended
approach.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 79add10c20

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…scaffolding

Two changes that emerged from validating the prior commits:

1. --keyword was scanning JSONL metadata, not user/assistant content. Common
   topic words like "session" matched every file via the sessionId field —
   on a 4-session keyword-match fixture, only 1 had real topical content,
   but all 4 returned non-zero match_count from metadata noise. extract-metadata.py
   now extracts user message text + assistant text blocks first and scans
   only those, skipping JSONL field names, tool_use blocks (tool names + inputs),
   tool_result blocks, and thinking blocks. Verified on the same fixture:
   files_matched drops from 4 to 1, only the real match remains.

   Adds 3 tests:
   - sessionId / gitBranch / parentUuid as keywords return zero matches
     against the Claude fixture (would have all matched under the prior impl)
   - "Edit" as a keyword does not match against tool_use names in the fixture
   - "auth" still matches against actual user/assistant text content

2. Drops the evals/ scaffolding added in earlier commits. The skill-creator
   skill is the canonical tool for evaluating agent and skill changes — it
   has its own conventions (evals/evals.json, <skill>-workspace/) and a
   purpose-built workflow. The repo-local scaffolding under evals/ was a
   one-off investigation artifact that doesn't conform to skill-creator's
   shape and would silently rot when agent definitions change.

   The durable lessons from that work are kept in repo-root AGENTS.md
   (skill-creator as the canonical path; agents AND skills both cache at
   session start; never edit ~/.claude/plugins/ to test). Removes the
   evals/ entry from the AGENTS.md Directory Layout section in the same
   commit so the directory reference doesn't outlive the directory.

The two positive-path scenarios (branch-match success, keyword-match without
branch-match) were both validated during this work via the skill-creator
dispatch pattern. Results captured in the PR description and commit history;
they don't need a committed results doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b65f188b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tmchow and others added 2 commits April 25, 2026 23:25
- Pre-resolved repo name: distinguish absolute vs relative output from
  git rev-parse --git-common-dir. The previous case-on-".git" check failed
  from a normal repo subdirectory (where the command returns ../.git, not
  .git or an absolute path), making the prompt resolve to ".." instead of
  the repo name. Applied to ce-compound, ce-sessions, and the
  ce-session-historian agent's Step 1 example.
- extract-metadata.py: defer the full-file --keyword scan until after
  --cwd-filter passes. Previously process_file ran the keyword scan before
  the cwd_filter check, which on Codex (cross-repo discovery) wasted
  scanning on sessions immediately discarded by the filter — recreating
  the long-runtime behavior this work is trying to eliminate.
- extract-metadata.py: emit files_matched: 0 in the empty-input _meta
  branch when --keyword was supplied. Without it, no-result keyword scans
  were ambiguous to the historian's "files_matched: 0 -> stop" rule.
- Tests: added a Codex cross-repo-filtered + --keyword regression test
  and an empty-stdin + --keyword test covering the previously-uncovered
  no-input branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- Strip Codex <system_instruction>...</system_instruction> wrapper from the
  keyword-scan corpus in _extract_user_assistant_text. Codex prepends the
  wrapper to event_msg.user_message payloads (e.g., "You are working inside
  Conductor."), and counting matches against that text produced false
  positives on environment-label terms. Mirrors the existing split in
  extract-skeleton.py.
- Test: searching the Codex fixture for "Conductor" now returns zero matches,
  since "Conductor" only appears inside the wrapper.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9a8291047a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/compound-engineering/agents/ce-session-historian.agent.md Outdated
- Reorder Step 3 sub-steps in ce-session-historian.agent.md: drop
  out-of-window sessions and exclude the current session BEFORE applying
  the 5-session deep-dive cap. Previous order capped first, which could
  discard all in-window candidates when high-scoring older sessions
  occupied the cap slots — leaving the agent to falsely return "no
  relevant prior sessions" even when valid in-window matches existed
  further down the candidate list. Tie-breaker rules (branch-match →
  match_count → file size → recency) and STOP semantics unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@tmchow tmchow merged commit a91270c into main Apr 26, 2026
2 checks passed
@github-actions github-actions Bot mentioned this pull request Apr 26, 2026
michaelvolz pushed a commit to michaelvolz/compound-engineering-plugin-windows-version that referenced this pull request Apr 28, 2026
… tighten dispatch (EveryInc#699)

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ce-session-historian inefficient on sparse-history dispatches (17min wall, 33 tool calls)

1 participant