fix(session-historian): cap deep-dives, add keyword filter primitive, tighten dispatch#699
Conversation
… tighten dispatch
Sparse-history dispatches to ce-session-historian were running 17+ minutes
with 33 tool calls before stream-idle-timing-out, when the correct answer
("no relevant prior sessions") should arrive in under a minute. Six gaps
were compounding:
- No skill primitive for "search inventory by keyword across sessions",
forcing the agent to roll 20 by-hand `grep -l` invocations across MB-sized
JSONL files.
- Soft "typically 2-5 sessions" guidance with no hard cap.
- Tail-after-head extraction allowed unconditionally; ran reflexively on
half of selected sessions.
- Verbose dispatch prompt from /ce-compound with topic-keyword bullets that
licensed the agent to keep widening.
- Repo name not pre-resolved by caller; agent burned its first turn deriving
via git rev-parse.
- No wall-time budget anywhere in the agent.
This change:
- Adds an opt-in --keyword K1[,K2,...] mode to ce-session-inventory's
extract-metadata.py. When set, the script does a full-file case-insensitive
substring scan, filters out zero-match sessions, and emits per-session
match_count plus per-keyword counts. _meta gains files_matched.
- Tightens ce-session-historian: 5-7 min wall budget, hard cap of 5
deep-dives, conditional tail-extract, gitBranch start-of-session
limitation note, Step 3 #4 now points to --keyword instead of by-hand grep.
- Tightens /ce-compound: pre-resolves repo + branch via backtick syntax,
rewrites the historian dispatch block as a 5-field schema (pre-resolved
context, time window, problem topic, filter rule, output schema).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4d753c08ea
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…d evals scaffolding The first iteration on session-historian (this PR's prior commit) tightened prose around extraction caps, conditional tail extraction, and a keyword-search hint, but did not stop the agent's strongest default behavior: when the candidate list is small enough to fit under the cap, the agent extracts every returned session "to verify" rather than running the keyword filter first. On sparse-mismatch dispatches that means deep-extracting 3 unrelated sessions instead of issuing one --keyword invocation that returns 0 matches in a single call. This commit replaces the soft Step 3 priority list with an explicit numbered decision sequence: - Branch filter first. - If branch filter is empty, run ce-session-inventory --keyword with keywords derived from the dispatch problem topic. - If files_matched is 0, return "no relevant prior sessions" and STOP -- ce-session-extract is not invoked. Step 4 gains an explicit "only run if Step 3 produced selected sessions" guard, and a new guardrail at the top forbids extraction-to-verify outright. The time-budget block drops the 5-7 minute target wording (which read as a target rather than a max) in favor of "stop when complete; structural caps bound runtime by construction." Adds top-level evals/ for repo-only LLM-driven behavioral checks. The session-historian eval covers the sparse-history scenario via synthetic ~/.claude/projects/-tmp-eval-... fixtures and a generic-subagent dispatch pattern (inject the agent definition into a general-purpose subagent rather than dispatching the typed agent, which reads cached definitions from session start). Documents the cache caveat in evals/README.md. Validated on the sparse-mismatch scenario: - Before: 4 tool calls (inventory + 3 deep extracts), agent ignored keyword filter even with explicit prose guidance. - After: 2 tool calls (inventory + --keyword filter), zero deep extractions, correct response. Wall time 55s, well under the 60s soft target. Full results in evals/session-historian/results-2026-04-25.md. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ll changes When iterating on a plugin agent or skill, dispatching the typed agent in the same session does not test repo edits — the agent definition is loaded once at session start and cached in memory. The previous PR commit added an evals/ scaffold but pointed to "restart the session" as the validation path, which is correct but heavy. The right tool is the skill-creator skill: it spawns a generic subagent and injects the agent or skill content into the subagent's prompt at dispatch time, so each run reads from current disk and iteration works inside a single session. Adds a "Validating Agent and Skill Changes" section to repo-root AGENTS.md that names skill-creator as the primary path, calls out the session-start caching gotcha, and explicitly warns against editing ~/.claude/plugins/ to try to force a reload — that path was tried during this work, did not bypass the cache, and is not a valid testing technique. Updates evals/README.md and the session-historian eval docs to point to skill-creator first, restart as a fallback only, and to reframe the "file-sync under ~/.claude" narrative so it doesn't read as a recommended approach. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 79add10c20
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…scaffolding
Two changes that emerged from validating the prior commits:
1. --keyword was scanning JSONL metadata, not user/assistant content. Common
topic words like "session" matched every file via the sessionId field —
on a 4-session keyword-match fixture, only 1 had real topical content,
but all 4 returned non-zero match_count from metadata noise. extract-metadata.py
now extracts user message text + assistant text blocks first and scans
only those, skipping JSONL field names, tool_use blocks (tool names + inputs),
tool_result blocks, and thinking blocks. Verified on the same fixture:
files_matched drops from 4 to 1, only the real match remains.
Adds 3 tests:
- sessionId / gitBranch / parentUuid as keywords return zero matches
against the Claude fixture (would have all matched under the prior impl)
- "Edit" as a keyword does not match against tool_use names in the fixture
- "auth" still matches against actual user/assistant text content
2. Drops the evals/ scaffolding added in earlier commits. The skill-creator
skill is the canonical tool for evaluating agent and skill changes — it
has its own conventions (evals/evals.json, <skill>-workspace/) and a
purpose-built workflow. The repo-local scaffolding under evals/ was a
one-off investigation artifact that doesn't conform to skill-creator's
shape and would silently rot when agent definitions change.
The durable lessons from that work are kept in repo-root AGENTS.md
(skill-creator as the canonical path; agents AND skills both cache at
session start; never edit ~/.claude/plugins/ to test). Removes the
evals/ entry from the AGENTS.md Directory Layout section in the same
commit so the directory reference doesn't outlive the directory.
The two positive-path scenarios (branch-match success, keyword-match without
branch-match) were both validated during this work via the skill-creator
dispatch pattern. Results captured in the PR description and commit history;
they don't need a committed results doc.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b65f188b5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Pre-resolved repo name: distinguish absolute vs relative output from git rev-parse --git-common-dir. The previous case-on-".git" check failed from a normal repo subdirectory (where the command returns ../.git, not .git or an absolute path), making the prompt resolve to ".." instead of the repo name. Applied to ce-compound, ce-sessions, and the ce-session-historian agent's Step 1 example. - extract-metadata.py: defer the full-file --keyword scan until after --cwd-filter passes. Previously process_file ran the keyword scan before the cwd_filter check, which on Codex (cross-repo discovery) wasted scanning on sessions immediately discarded by the filter — recreating the long-runtime behavior this work is trying to eliminate. - extract-metadata.py: emit files_matched: 0 in the empty-input _meta branch when --keyword was supplied. Without it, no-result keyword scans were ambiguous to the historian's "files_matched: 0 -> stop" rule. - Tests: added a Codex cross-repo-filtered + --keyword regression test and an empty-stdin + --keyword test covering the previously-uncovered no-input branch. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- Strip Codex <system_instruction>...</system_instruction> wrapper from the keyword-scan corpus in _extract_user_assistant_text. Codex prepends the wrapper to event_msg.user_message payloads (e.g., "You are working inside Conductor."), and counting matches against that text produced false positives on environment-label terms. Mirrors the existing split in extract-skeleton.py. - Test: searching the Codex fixture for "Conductor" now returns zero matches, since "Conductor" only appears inside the wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9a8291047a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Reorder Step 3 sub-steps in ce-session-historian.agent.md: drop out-of-window sessions and exclude the current session BEFORE applying the 5-session deep-dive cap. Previous order capped first, which could discard all in-window candidates when high-scoring older sessions occupied the cap slots — leaving the agent to falsely return "no relevant prior sessions" even when valid in-window matches existed further down the candidate list. Tie-breaker rules (branch-match → match_count → file size → recency) and STOP semantics unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… tighten dispatch (EveryInc#699) Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Summary
ce-session-historianwas running 17+ minutes with 33 tool calls and stream-idle-timing-out on dispatches with no relevant prior sessions, where the correct answer should arrive in under a minute. The agent had no way to filter sessions by content efficiently, treated repo membership as sufficient relevance signal, and received a verbose dispatch prompt that licensed it to keep widening the search.Fixes #696
What changed
ce-session-inventorygains a--keywordfilterextract-metadata.pyadds an opt-in--keyword K1[,K2,...]mode that scans only user and assistant text content (not JSONL metadata, tool calls, or thinking blocks), filters out zero-match sessions, and emits per-sessionmatch_countplus per-keyword counts. This replaces the pattern of hand-rolling per-filegrep -linvocations across multi-MB JSONL files.The keyword scan runs only after
--cwd-filterpasses, so cross-repo Codex sessions are not scanned just to be discarded. The empty-input branch emitsfiles_matched: 0when--keywordwas supplied, so callers that gate on it can short-circuit cleanly. For Codex sessions specifically, the<system_instruction>...</system_instruction>wrapper is stripped before counting, so environment terms like "Conductor" do not false-match.ce-session-historianadds a relevance gate before extractionStep 3 is now an explicit numbered decision sequence: branch filter first; if zero candidates, run
--keywordand stop onfiles_matched: 0without extracting; otherwise apply the hard cap of 5 deep-dives and extract. A new top-level guardrail (Never extract a session to verify whether it is relevant) makes the gate binding, since prose-level priority lists did not prevent extract-to-verify behavior. Tail extraction is now conditional, only invoked whenhead:200terminates mid-investigation. ThegitBranchcaveat (captured at the first user message only, so branch-miss is not conclusive) is documented inline. The time budget drops the minute target in favor of "stop when complete"; structural caps in Step 3 and Step 4 bound runtime by construction./ce-compounddispatch tightensThe Session Historian dispatch in Phase 1 was a long context block with topic-keyword bullets that licensed widening. The new dispatch is a 5-field schema: pre-resolved repo and branch, 7-day window, one-sentence problem topic, one-line filter rule, fixed output schema. Pre-resolution uses a
caseon absolute-vs-relative output fromgit rev-parse --git-common-dir, which correctly handles repo root, subdirectory, and linked-worktree invocations.AGENTS.md gains a "Validating Agent and Skill Changes" section
The section documents how to test agent and skill changes correctly: use the
skill-creatorskill, which spawns a generic subagent and injects content from disk at dispatch time. Both plugin agents and skills cache at session start, so dispatching the typed agent or invoking via the Skill tool inside the same session tests cached pre-edit content. Editing~/.claude/plugins/cache/or~/.claude/plugins/marketplaces/to force a reload is explicitly called out as wrong.Test plan
tests/session-history-scripts.test.tsadds 12 new cases under--keyword modecovering single-keyword filtering, zero-match exclusion, OR semantics, case insensitivity, content-only scanning (sessionId / gitBranch / tool names do not false-match), CWD-filter ordering with--keyword, empty-input emitsfiles_matched: 0, and the Codexsystem_instructionstrip.bun testpasses 951/951;bun run release:validateclean.Agent prose changes are validated via the
skill-creatorpattern documented in the new AGENTS.md section: spawn ageneral-purposesubagent with the agent definition injected from disk. Three scenarios were exercised during this work (sparse-mismatch, branch-match success, keyword-match without branch-match), all behaved as designed.