Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat(optimize): flag low-worth expensive sessions#247

Merged
iamtoruk merged 1 commit into
mainfrom
feat/worth-it-score
May 6, 2026
Merged

feat(optimize): flag low-worth expensive sessions#247
iamtoruk merged 1 commit into
mainfrom
feat/worth-it-score

Conversation

@iamtoruk

@iamtoruk iamtoruk commented May 6, 2026

Copy link
Copy Markdown
Member

Supersedes #241 (cross-fork PR by @ozymandiashh — original intent preserved, this branch was built on current main with the review fixes integrated cleanly into the #246 dedup pattern instead of being layered on top of an older base).

Summary

Adds detectLowWorthSessions to codeburn optimize. Flags expensive sessions (≥$2 spend; ≥$3 if no edit turns) with weak delivery signals — no edits, repeated retries, or edit work that never landed in one shot — when no git/gh delivery command is observed in the bash history. Framed as review candidates, not proof of waste.

Detection model

  • $2 floor; $3 floor when "no edit turns" is the only signal
  • 3 retries to trip the retry reason; 2 retries with edits and zero one-shot edits to trip the "no one-shot edit turns" reason
  • categoryBreakdown aggregates preferred when present, falls back to raw turns
  • Delivery commands: git commit, git push, gh pr create, gh pr merge (excluding --dry-run in the same pipeline segment)

Review fixes integrated on top of #241's commit

  1. Triple-detector dedup (extends feat(optimize): detect context-heavy sessions #246). Priority order: low-worth → context-bloat → outliers. findLowWorthCandidates and findContextBloatCandidates build ID sets ahead of detection; detectContextBloat and detectSessionOutliers accept an excludedSessionIds param and filter accordingly. Real-data top-5 lists are now disjoint across all three findings.
  2. commit-tree regex false positive fixed. Used (?:\s|$|--) after commit|push instead of \b, so git commit-tree HEAD^{tree} and git commit-graph write are no longer treated as deliveries while git commit --amend still is. New tests cover both cases.
  3. Three impact tiers consistent with feat(optimize): detect context-heavy sessions #246: high (≥10 candidates OR ≥$50 total) · low (≤2 candidates AND <$10 total) · medium otherwise. Replaces the original binary tiering.
  4. Replaced the magic 0.5 token-savings ratio with a two-regime model:
    • No-edit sessions: full session token total (the session produced no apparent output to weigh against the spend).
    • Sessions with edits but with retries / no one-shot: retry fraction (retries / totalTurns × tokens, clamped to [0,1]). Edits may still have been useful; we credit the model with that and only flag the retry overhead.
  5. Fix-text differentiated from the outlier detector. Outlier still says "tighter constraint, smaller plan." Low-worth now says "name the deliverable in one sentence; stop after 10 minutes without an edit or 2 failures; no retries past 2 attempts on any single fix."

Validation

  • npx vitest run — 38 files, 529 tests pass (was 498 before, +31 new)
  • Real-data run on a 22.6K-session / $4.8K archive:
    • Top-5 lists across low-worth, context-bloat, outliers are completely disjoint
    • Headline savings: ~17% of spend (was 10% pre-low-worth; would've been 54% with the naive full-session ceiling; was triple-counted in the original PR)
    • Per-finding totals: low-worth $629, context-heavy $54, outliers $123
  • tsc --noEmit: pre-existing copilot.ts errors on this base, same as feat(optimize): detect context-heavy sessions #246 had. CI doesn't run tsc (only semgrep). Resolved on the user's local feat branch but not yet on origin/main; not introduced by this PR.

Security

No new attack surface. Bash command strings come from user's session telemetry, regex is anchored, no shell, no eval, no I/O.

Adds detectLowWorthSessions to the optimize pipeline. Flags expensive
sessions (>=$2 spend; >=$3 if no edit turns at all) with weak delivery
signals -- no edits, repeated retries, or edit work that never landed
in one shot -- when no git/gh delivery command is observed in the
session's bash history.

Built on top of the existing #246 dedup pattern. Priority order is
low-worth -> context-bloat -> outliers; each later detector excludes
sessions already named by an earlier one so a single session is never
listed in three findings.

Detection model:
- $2 floor; $3 floor when the only signal is "no edit turns".
- 3 retries to trip the retry reason; 2 retries with edit turns and
  zero one-shot edits to trip the "no one-shot edit turns" reason.
- categoryBreakdown aggregates are preferred when present; falls back
  to raw turns for older parsed sessions.

Delivery-command regex uses (?:\s|$|--) instead of \b after commit/push
to avoid false positives like `git commit-tree HEAD^{tree}` and
`git commit-graph write` while still matching `git commit --amend`.
A `--dry-run` lookahead in the same pipeline segment excludes preview
commands.

Three impact tiers consistent with detectContextBloat: high at
>=10 candidates or >=$50 total candidate spend; low at <=2 candidates
AND <$10 total; medium otherwise.

Token-savings estimate replaces the original 0.5 magic ratio with two
defensible regimes:
- No-edit sessions: full session token total (the session produced
  no apparent output to weigh against the spend).
- Sessions with edits but with retries / no one-shot: retry fraction
  (retries / total turns) of the session token total. Edits may still
  have been useful; we credit the model with that and only flag the
  retry overhead.

Fix-text differentiated from outlier detector's: low-worth focuses on
naming a deliverable up front and capping retry attempts; outliers
keeps the existing "tighter constraint" framing.

Tests:
- New describe block for detectLowWorthSessions (17 tests covering
  thresholds, reasons, delivery-command detection including
  commit-tree false-positive guard, dry-run handling, three impact
  tiers, retry-fraction savings model, full-session no-edit savings).
- One new test for detectContextBloat asserting it honors the
  excludedSessionIds parameter (not just outliers).
- Real-data run on 22.6K sessions: top-5 lists are now disjoint
  across all three per-session detectors, headline savings is 17% of
  spend (vs 10% pre-PR with just context-bloat dedup, vs 54% if we
  had used the full-session ceiling for low-worth).
@iamtoruk iamtoruk merged commit 75d4701 into main May 6, 2026
3 checks passed
@iamtoruk iamtoruk deleted the feat/worth-it-score branch May 6, 2026 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant