All notable changes to tracker will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
-
build_productworkflow: closed Gap 7 from the #233 audit (interface-method reachability) (#233). The audit caught three Go interface methods defined and unit-tested but never called from production code:AuthStatus(ctx) error(Appendix A I9),IsRebaseInProgress() bool(I10),DiffStat(similar shape). Tests passed because the same agent wrote impl and tests; the workflow had no check that defined interface methods have a non-test caller. New mechanism:Setupnow writes.ai/build/iface-reachability-rubric.md— a shared discipline file mirroring PR #246'sci-probe.shpattern. Contains language detection (10 static-interface languages with skip-vacuously rule for Ruby / plain JS / Elixir / Zig / C / shell), per-language enumeration grep patterns, caller-discipline rules (call-syntax targeting, common-name receiver context, broad test-file exclusion glob, generated-code-counts-as-production), stdlib/framework satisfaction principle, single-sentence waiver discipline, library-API carve-out via.ai/decisions/library_api.md, and known-limitation skips (Rustdyn Trait, Haskell typeclasses, TS bracket-notation, Swift extension conformances, etc.) with named-reason discipline.FinalSpecCheckowns the check — repo-wide sweep, single owner, fires once per run at the goal-gate immediately before Cleanup/FinalCommit/Done. The prompt opens with an inverted STATUS contract: agent emitsSTATUS:failas the FIRST line of its response, then enumerates, then emitsSTATUS:successas the LAST line only if every check passes. This defends against theparseAutoStatusdefault-to-success-on-empty fail-open shape that is exactly the original Gap 7 bug — under last-line-wins parsing, a truncated response preserves the early fail. The pre-existingSTATUS:fail with specific list of gapsline was also fixed (parser requires the STATUS value to be exactlysuccess/fail/retry— trailing prose on the same line is silently discarded). Output is prose enumeration (no markdown table — that introduced parser-fragility risk because STATUS tokens inside table cells aren't parsed).- Reviewer rubric point 2 strengthened in
ReviewClaude,ReviewCodex,ReviewGemini— from 3-line "name a caller" to 9-line "show the grep command, paste the output, cite file:line — same show-your-work standard as SPEC LITERALS at point 1." The heavy discipline lives in the shared rubric file the reviewers reference, keeping rubric balance with the other four points (per PR #249's design). VerifyMilestoneis unchanged. Per-milestone iface checks were considered and rejected (squad review): the bug is a property of the terminal state not the build process, per-milestone scoping creates a cross-milestone leak shape that would have required LLM-managed.ai/pending_wiring.mdbookkeeping (~17% reliability after 5 milestones per parser-pragmatist analysis), and the workflow is fully automated so "catch early" has no operational value when no human is debugging mid-run.- New regression test:
TestParseAutoStatus_V3FailFirstContractinpipeline/handlers/codergen_test.gopins the parser's last-line-wins / default-success-on-empty semantics that the new FinalSpecCheck STATUS contract relies on. Three subtests: terminal-success-wins, mid-check-fail-remains, no-STATUS-default-success. - Workflow score on
dippin doctor examples/build_product.dipstays A / 100/100, no new lint warnings. Gap 8 (TestQualitystep) is closed in the next bullet below; Gap 5 (engine-levelauto_statusaudit) remains the last Chunk C item.
-
build_productworkflow: closed Gap 8 from the #233 audit (test-quality smells) (#233). The audit caught five test-quality regressions that shipped green when the same agent wrote impl and tests: zero-assertion (goblin.starttest logs an empty SHA without asserting on it — W4), wrong-target (TestRun_SignalHandlingtests stdlibsignal.NotifyContextnot the daemon's handler — W5), DI bypass (tests calltime.Now()directly when aClockseam exists — W13), sleep-as-fence (TestLoop_BusyDropsuses threetime.Sleepcalls between phases — W17), subname collision (bytes_trimSpaceshadowsbytes.TrimSpaceunder Go's subtest case-fold — W21). The audit's failure mode was reviewers handwaving past prose-only checks; this PR mitigates by adding shown-work demands at the layer that catches each smell most reliably:FinalSpecCheckgrows a TEST QUALITY section for sleep-as-fence only — one new ~30-line block between INTERFACE REACHABILITY (Gap 7) and SPEC.md compliance. Per-language sleep-class greps (Gotime.Sleep/<-time.After, Python(time|asyncio|trio|anyio|gevent).sleep, JS/TSawait sleep/setTimeout/waitForTimeout/cy.wait, Rustthread::sleep/tokio::time::sleep/async_std::task::sleep, Rubysleep/Kernel.sleep, Java/KotlinThread.sleep/delay/Mono.delayvia portablefind ... -exec grep ... {} +(POSIX-portable; avoids the GNU-onlyxargs -rflag flagged by Codex review)). Each hit needs disposition: (a) the sleep IS the SUT — cite the SPEC.md timing-contract section file:line, OR (b) replaced by a deterministic primitive — cite the primitive's introduction, OR (c) waiver per.ai/decisions/*.mdnaming the test with smell-specific rationale citing a SPEC.md section. "Intentional timing test" without a spec citation is blanket — FAIL. Covers W17.- Reviewer rubric point 3 strengthened across all three reviewers (
ReviewClaude,ReviewCodex,ReviewGemini). Two changes: (1) The Gemini-only sentence "Tests that only validate standard-library or third-party-library behavior instead of the project's own logic are FAIL" is promoted to ReviewClaude and ReviewCodex — covers W5 across all three lanes. (2) Each reviewer's point 3 now demands shown-work grep evidence for two semantic smells the audit specifically found: (a) zero-assertion test bodies, (b) DI bypass totime.Now/rand.Read/ stdlib-IO when production code defines a Clock/Random/IO seam. Cite the grep command and its output for each hit AND for the empty-result case (same shown-work standard as point 1 SPEC LITERALS, per PR #249). Reviewer also audits.ai/decisions/*.mdwaivers and verifies cited SPEC.md sections actually contain content supporting the waiver's rationale (defeats the "cite a real section that doesn't actually relate" forgery shape). Covers W4 and W13. - Legacy
FinalSpecCheckSTATUS tail fixed. The pre-PR-#254 framing at the bottom of FinalSpecCheck (If fully compliant: STATUS:success / If not: emit STATUS:fail) contradicted PR #254's inverted contract opening — under last-line-wins parsing, a passing SPEC.md section reached mid-survey could override the earlySTATUS:failfor INTERFACE REACHABILITY or TEST QUALITY. The tail is rewritten to require all three sections to pass before emitting terminalSTATUS:success, with single-source-of-truth reference to the existing allowlist above (not inline re-enumeration). - What's intentionally NOT in this PR. W21 (Go subtest case-fold collision) is explicit non-goal — Go-specific lexical bug, golangci-lint territory. No chmod+sha tripwire on Gap 7's rubric file. No integrity preamble. No SPEC.md SHA verification. No DI-bypass cross-check folded into Gap 7's rubric heredoc. No inline TEST QUALITY rubric enumerating Smell 1 (zero-assertion) at FinalSpecCheck. The audit found honest LLM oversight, not adversarial mutation — defense is sized to the observed threat model. Three squad-review rounds taught us what to trim; the shipped PR is ~83 net prompt lines (40 sleep-as-fence block + 33 reviewer-rubric appends + 6 Gemini-sentence promotion + 4 STATUS tail rewrite) — well below v4's ~165-200 and v1's ~330.
- Spec:
docs/superpowers/specs/2026-05-26-gap8-test-quality-design.md(v5). Plan:docs/superpowers/plans/2026-05-26-gap8-test-quality-plan.md. - With Gap 8 landed, seven of eight #233 audit gaps are closed (1, 2, 3, 4, 6, 7, 8). Gap 5 (engine-level
auto_statusaudit +OutcomeHumanOverride+ re-runFinalSpecCheckafterApplyReviewFixes) remains as the final Chunk C work beforerelease: v0.31.0closes out #233.
-
build_productworkflow: closed Gaps 2 + 4 from the #233 audit (reviewer rubric overhaul +VerifyMilestonereads SPEC.md) (#233). PR #246 shipped the cheap trio (Gaps 1, 3, 6); this PR is the next chunk. Additional audit findings from #233 Appendix A are now caught upfront by these two gaps, in addition to what PR #246 already catches (see per-gap Appendix A maps below — these overlap with the #246 set on B1, B3, B5).- Gap 2 — Cross-review prompts (
ReviewClaude,ReviewCodex,ReviewGemini,SynthesizeReviews) overhauled. Pre-#233 the three reviewer prompts were 4-8 lines each of free-form focus areas. On the offending runReviewGeminireturned "faithful and high-quality realization — PASS / PASS / PASS / PASS" because it read the spec's own ✅ markers as evidence, missing the API-shape blocker (B1), the off-by-one retries (B3), thefingerprintsscope creep (B4), the redmake lint(B6), and the dead interface methods (I9, I10).SynthesizeReviewsthen weighted findings by vote count, so a 2-vote PASS could drown a 1-vote FAIL with concrete evidence — synthesis missed ~33 of the 38 real audit findings. Four changes:- All three reviewers now carry the same 5-point structured rubric (spec literals grep, interface reachability, test-verifies-contract, scope, architecture & leftovers). Each rubric question requires concrete evidence (grep output, file:line, snippet) for any FAIL — free-form lane-specific focus areas come AFTER the rubric, not instead of it.
- "Do not trust spec ✅ markers" warning at the top of every reviewer prompt. Single sentence; would alone have fixed the Gemini failure mode on the offending run.
ReviewGeminiretargeted to explicit adversarial / steel-man role. Pre-#233 Gemini's lane was "intent vs letter / advice respected / performance / UX" — the same vague focus that produced the PASS / PASS / PASS / PASS report. Post-#233 Gemini's job is to find what's wrong even when everything looks fine; "faithful and high-quality realization" is a forbidden phrase. Claude stays generalist (missing requirements, architectural violations, leftover artifacts), Codex stays quality-focused (test coverage, edge cases, regression risk). Each reviewer contributes a distinct angle but none can skip the rubric.SynthesizeReviewsweights by evidence, not vote count. A single reviewer with grep / file:line evidence of a contract-level FAIL now wins over two evidence-free PASSes — STATUS:fail flips on one evidence-backed finding, not on majority. The synthesis document now has a dedicated "Evidence-backed findings" section listing single-reviewer concrete-evidence flags so they can't be silently dropped into "Disputed".- Catches B1, B3, I3, I9, I10, W3, W6, W7, W8, W10, W14, W16, W18, W19, W20, W22 from #233 Appendix A — every contract-level finding that needed grep / file:line evidence to surface.
- Gap 4 —
VerifyMilestonereads SPEC.md, runs explicit grep checks, applies the test-asserts-contract check. Pre-#233 the verifier only read.ai/milestones/current.mdand accepted "tests pass" as evidence of completion — inheriting the implementing agent's blind spot. IfDecomposedropped a spec requirement during milestone planning, the verifier had no path to discover the gap. New responsibilities:- Read SPEC.md directly (not just the milestone notes). The verifier now cross-checks every SPEC.md bullet intersecting this milestone's file list, whether or not the milestone notes mention it.
- Run spec-literal greps inline. For every literal value in the spec sections this milestone covers, grep the implementation and paste the command + result into the verification report. Missing literals are FAIL.
- Apply the test-asserts-contract check. For each test touched by the milestone, the verifier asks: "If the production code were deleted and rewritten differently but spec-conformantly, would this assertion still pass for the right reason?" The off-by-one
attempts == 2pattern is called out by name as a FAIL pattern. Tests asserting fields marked "DO NOT implement" in the milestone (Gap 6 affordances) are FAIL. - Confirms project CI + tests both passed by reading
${ctx.tool_stdout}(TestMilestone already ran both per Gap 1; the verifier does NOT re-run them, just gates on the evidence). - Walks the diff for out-of-scope work — files / functions / fields the milestone didn't ask for.
- Catches B1, B5, I1, I2, I4, I5, W11, W12 from #233 Appendix A — every contract-level finding that should have been caught BEFORE cross-review.
- Workflow score on
dippin doctor examples/build_product.dipstays A / 100/100, 25 nodes, 49 edges, no new lint warnings. The remaining #233 gaps (5 engine-levelauto_statusaudit, 7 interface reachability, 8TestQualitystep) are queued for a follow-up PR.
- Gap 2 — Cross-review prompts (
-
dippin-lang dependency bumped v0.28.0 → v0.29.0 (#250). Picks up the three tool-routing follow-ups deferred from #247 (closes dippin-lang#42, #43, #44):
- dippin#42 (DIP138 / lint suppression): dippin-lang now suppresses DIP101 / DIP102 coverage warnings on tool nodes that declare
marker_grep:— the typed routing channel viactx.tool_markeris now recognized as exhaustive in the same wayctx.outcome = success/failpairs are. Workflows that route on_TRACKER_ROUTE=<marker>no longer get false-positive grade drops ondippin doctor. Optional new advisory DIP138 fires when a tool node uses conditional edges onctx.tool_stdoutbut declares neithermarker_grep:noroutputs:, pointing authors at the typed routing primitive. Pure dippin-internal change; no tracker code needed. - dippin#43 (parseBoolAttr normalization):
goal_gate:,auto_status:,cache_tools:, androute_required:now accept canonical truthy/falsy forms (true/false,1/0,yes/no,on/off, case-insensitive). Pre-v0.29.0 only"true"parsed as true;yes/1/TRUEsilently coerced tofalse— a foot-gun onroute_requiredespecially, where false-negative parsing silently disabled the runtime safety check. Anything outside the accepted set emits a parse diagnostic. Pure dippin-internal change; no tracker code needed. - dippin#44 (Outputs DOT round-trip + adapter passthrough):
ir.ToolConfig.Outputs(the comma-separatedoutputs:field declaring the tool's possible stdout values for coverage analysis) is now emitted to DOT (outputs="pass,fail") and parsed back bydippin migrate. Tracker's adapter (pipeline/dippin_adapter.go::extractToolAttrs) gains the matching passthrough so the field reachesnode.Attrs["outputs"]for both.dipand DOT inputs. The runtime doesn't consumeoutputsyet — this PR is plumbing only, paving the way for future output-set validation. Wire contract matches dippin'sapplyToolOutputsAttrs:strings.Join(cfg.Outputs, ","), omitted when empty. Two new tests (TestExtractToolAttrs_OutputsForwardedunit andTestFromDippinIR_ToolConfigOutputsend-to-end) follow the v0.28.0 pattern from PR #248. PinnedDippinVersionintracker_doctor.gobumped in lockstep soTestPinnedDippinVersionMatchesGoModpasses andtracker doctor's dippin-version check matches. All three example pipelines (ask_and_execute,build_product,build_product_with_superspec) remain A grade ondippin doctor.
- dippin#42 (DIP138 / lint suppression): dippin-lang now suppresses DIP101 / DIP102 coverage warnings on tool nodes that declare
-
dippin-lang dependency bumped v0.27.0 → v0.28.0 (#247). Picks up the new
ir.ToolConfigfieldsMarkerGrep,RouteRequired, andOutputLimit(closes dippin-lang#39).pipeline/dippin_adapter.go::extractToolAttrsnow forwards all three tonode.Attrsusing the same wire-contract names dippin-lang's DOT exporter emits (marker_grep,route_required="true",output_limit=<int>), so DOT ⇄ Dippin IR round-trips stay stable. The tracker runtime already consumes these attrs (pipeline/node_config.gofor read,pipeline/engine_run.goforEventToolMarkerMissing/EventToolRouteMissingemission) — this PR is the missing adapter passthrough. Pure pass-through: no runtime behavior change, fully backward compatible (workflows that don't declare the new fields produce identical adapter output).PinnedDippinVersionintracker_doctor.gobumped in lockstep sotracker doctor's dippin-version check matches. -
build_productworkflow: closed three gaps from the #233 audit (Gaps 1, 3, 6 — the "cheap trio") (#233). The audit ranbuild_productend-to-end on a real Phase 1 spec and found 38 issues the workflow declared "Done" on — including red CI, a wrong-shape OpenAI request body, an off-by-one retry count, and a Phase-6 feature shipped in Phase 1 with green tests pinning the wrong behavior. This change addresses the three lowest-cost, highest-leverage gaps inexamples/build_product.dip; the remaining five gaps (reviewer rubric overhaul,VerifyMilestonereading SPEC.md, engine-levelauto_statusaudit, interface-method reachability, and aTestQualitystep) are queued for a follow-up.- Gap 1 — project CI is now part of the gate.
TestMilestoneandFinalBuildpreviously only ran the language-stack default (go build && go test,npm test, etc.) and silently passed even when the project's ownmake lint/golangci-lintwas red. Both nodes now probe for aMakefile(alsomakefile,GNUmakefile) and run the first target that's defined out ofci/check/lint, gating on its exit. Detection parses the Makefile directly viased s/#.*//(strip comments) piped to anawkthat: (a) skips tab-prefixed recipe lines, (b) skips variable assignments by checking whether the first:is part of:=/::=/:::=via^:+=, then (c) tokenizes everything before the first:and looks for TARGET as a whitespace-delimited token — soci:,ci check lint:, andci: VAR := overriddenare all correctly detected whileci := valueand substring collisions likebuild-ci:are not. The initial draft of this PR usedmake -n <target>as the existence probe, but PR #246 review (Codex P1, CodeRabbit Major, four rounds of Copilot) flagged in succession: (1) GNU Make returns 0 when the target name happens to match a file or directory on disk (so aci/orlint/directory would false-positive), (2) a missingmakebinary or a Makefile parse error would collapse silently to "target absent" — letting CI-red projects through anyway, (3) the grep replacement missed multi-target rules likeci check lint:, and (4) it false-positived onci := valuevariable assignments. The current sed+awk parser sidesteps all four. When a Makefile is present butmakeisn't installed, the probe now fails loud instead of skipping the gate. When no Makefile or no matching target exists, the node emits anINFO: no project CI target in <Makefile>(orINFO: no Makefile present) line so operators readingtracker diagnosecan see the gap rather than assuming CI passed.FinalBuild's timeout bumped from 300s → 600s to accommodate the additional lint pass on real projects. Trade-off accepted: the parser misses targets defined only in included files; CI/lint/check are by convention declared in the root Makefile. - Gap 3 —
Implementprompt now anchors on spec literals and test quality. Added three rule blocks. (a) "Spec literals are contracts": for every literal value in the spec section (exact command strings, JSON keys, header names, integer constants, log keys), grep the implementation byte-for-byte before committing — silent paraphrase (--head <branch>vs spec's--head <owner>:<branch>) is the most common way milestones ship wrong. (b) "Tests verify the contract, not your code": every assertion must still pass for the right reason if the production code is deleted and rewritten spec-conformantly. The canonical failure mode is assertingattempts == 2when the spec says "max 2 retries" = 3 attempts — the test green-lights the off-by-one because it was written from the code. (c) "Snapshot tests must be hand-verified": the FIRST version of every golden file must be anchored to the spec, not regenerated from current output; only use UPDATE_GOLDEN after that. - Gap 6 —
Decomposenow produces an explicit "DO NOT implement" list per milestone, andImplementreads it. Added a new milestone field**DO NOT implement**:that names Phase 2+ features the spec defers but whose supporting types/functions live in this milestone's file list.Implementreads.ai/milestones/current.md's DO NOT lines before writing in any file and leaves those affordances inert (no wiring into call sites, no populating fields the spec marks as empty in this phase, no non-default return values from deferred helpers). Closes the entire "wrote-future-phase-with-green-tests" failure class — e.g., thefingerprintsfield that the audit found populated in Phase 1 despite the spec deferring it to Phase 6 would now appear as a DO NOT line in any Phase 1 milestone touching the trailer file. - Workflow score on
dippin doctor examples/build_product.dipis still A / 100/100, no new lint warnings; 25 nodes, all reachable.
- Gap 1 — project CI is now part of the gate.
tracker validateno longer prints every DIP1XX lint warning twice (#244). Two parallel emission paths fed the samevalidator.Lint()diagnostics through the CLI:loadDippinPipelineprinted each diagnostic in long form (with--> file:line:collocation and= help: ...suggestion) to stderr, andpipeline/validate.go'svalidateGraphfolded the pre-formatted single-lineGraph.LintWarningsstrings intoValidationError.Warnings, whichprintValidationResultthen re-emitted on stdout. The summary count was correctly de-duplicated (it counted vialintResult.Diagnostics, not the concatenation) but the printed list was not — a workflow with 5 DIP1XX warnings showed 10 warning lines. Fix is a print-time dedup at the CLI layer (issue's Option 3):printValidationResultbuilds a set fromgraph.LintWarningsand skips matching entries when iteratingresult.Warnings, leaving only tracker-side semantic warnings (e.g.validateConditionalFailEdges,validateEdgeLabelConsistency) on stdout. The long-form stderr diagnostic from the loader is the sole user-visible copy. The fix is deliberately NOT a removal ofve.Warnings = append(ve.Warnings, g.LintWarnings...)invalidateGraph: non-CLI consumers ofpipeline.ValidateAll/ValidateAllWithLint(tracker.ValidateSource,tracker_doctor.go::checkPipelineFile+checkPipelineBundle,cmd/tracker-conformance) rely onve.Warningsas the single source of pipeline warnings and would silently lose DIP1XX signal otherwise (caught in PR #245 review by Codex P2). Two new regression tests:TestValidateNoDuplicateLintWarningscaptures both stdout and stderr (via anos.Pipe-redirectedos.Stderrwith a single-pass cleanup path and a goroutine that closes its read end and surfacesio.ReadAllerrors viat.Fatalf) and asserts DIP warnings appear on stderr and are not re-emitted on stdout;TestValidateLintWarningsStillInWarningsChannelpins the API contract —pipeline.ValidateAllmust continue to expose DIP1XX warnings viaValidationError.Warningsso non-CLI consumers keep seeing them.
- dippin-lang dependency bumped v0.26.0 → v0.27.0 (#242). Picks up the 2026-05-18 model/pricing catalog refresh and the grok-4-1-fast-* redirect fix (callable model IDs survive their target's rename). No IR or adapter changes — drop-in.
PinnedDippinVersionintracker_doctor.goupdated in lockstep sotracker doctor's dippin-version check matches. Combined with v0.29.1's lint deduplication, DIP108 now covers the full current catalog — workflows usinggemini-3-flash-preview, redirected grok IDs, and other recent models validate clean.
- Defer all DIP-coded lint to dippin-lang (#239, #240). Deleted
pipeline/lint_dippin.go+pipeline/lint_dippin_extra.go(660+ lines covering DIP101–DIP112, DIP120, DIP121) — every one of those checks was already implemented indippin-lang/validator.Lint(), which tracker has been calling at.dip/.dipxload time since v0.16. The duplicates had drifted: tracker'sknownProviderModelscatalog hadn't been updated past Gemini 2.5, so any pipeline usinggemini-3-flash-previewor other current model names produced a false-positive DIP108 warning fromtracker validateeven thoughdippin doctoraccepted the same file cleanly. Tracker's local DIP120/DIP121 also semantically collided with dippin-lang's DIP120/DIP121 (different checks under the same code numbers).validateNodeAttributes(which re-checked typed IR fields likemax_retries,cache_tool_results,context_compaction,context_compaction_threshold) is now gated behind!g.DippinValidated, matching the existing pattern for DIP001–DIP009 structural checks — so for.dipsources tracker fully defers to dippin's typed IR + DIP116 lint instead of re-validating with different (and stricter) semantics. After this change, dippin-lang is the sole authority for DIP-coded lint and adding a new model to the catalog only requires one edit. Tracker keepsLintTrackerRules(TRK1XX) — those encode tracker-runtime concerns (64KB tool-output cap, tail-window routing-marker pitfalls) that don't belong upstream. Lint warnings still surface intracker validate/simulate/doctoroutput via the newGraph.LintWarningsfield, whichLoadDippinWorkflowFromIRpopulates fromvalidator.Lint()andValidateAllappends to its warnings channel — so the user-visible "Validation Warnings" section is unchanged except that the warnings now come from the current dippin-lang catalog instead of tracker's stale copy.
-
Workflow header
requires: <list>for environmental dependencies (#234). Workflows can now declare prerequisites at the top of the.dipfile via a comma-separated list (e.g.requires: git). v0.29.0 implementsgit: when a workflow declaresrequires: git, tracker verifiesgitis installed AND the working directory is a git repository before any node executes. Unrecognized entries (docker,gh,jq, etc.) warn and continue, so workflow authors can forward-declare dependencies that future tracker versions will check. The mechanism lives at the library + CLI boundary, not inside the engine —pipeline.Preflightis invoked once at run start; subgraph andmanager_loopchildren inherit the parent's check. Requires dippin-lang v0.26.0 (dippin-lang#35, #36). -
--git=auto|off|warn|require|initCLI flag to override the policy per run. Defaultautorespects the workflow'srequires:declaration.--git=offbypasses all git checks (escape hatch).--git=warndowngrades a hard failure to a warning and continues.--git=requireforces the check even when the workflow doesn't declare it.--git=init(with mandatory--allow-initlatch in non-interactive runs, or a[Y/n]prompt in interactive runs) auto-runsgit initin the workdir followed by an empty initial commit (ephemeral-c user.name=tracker -c [email protected], so the user's git config is not mutated). The initial commit means the resulting repo is immediately worktree-ready: the built-in workflows that rungit worktree add ... HEADearly on (ask_and_execute,build_product_with_superspec) would otherwise pass preflight against an unborn-HEAD repo and crash deep in setup after burning user / LLM steps. Safety refusals fire for$HOME,/, and nested repos — including linked worktrees (where.gitis a file, not a directory), submodules (same), and bare repos (no.gitat all). The$HOMEand root refusals use a case-aware comparison (case-insensitive on Windows soC:\Users\Bobandc:\users\bobboth trip the latch). The repo-satisfaction probe (checkGit, "wouldgit commitwork here?") usesgit -C <dir> rev-parse --is-inside-work-tree, so bare repos correctly classify as not-a-repo forrequires: gitpurposes — work-tree-only operations would fail in a bare repo. The nested-repo safety latch (safetyLatches, "is this any kind of git context wheregit initwould create a confusing duplicate?") usesgit -C <dir> rev-parse --git-dir, which catches bare repos, linked worktrees, AND submodules. Two different probes for two different questions, neither a parent-directory walk for.git. All git probes run withLANG=C LC_ALL=C LANGUAGE=C(pipeline.GitProbeEnv()) so the"not a git repository"stderr classifier is stable on localized git installations. -
tracker doctorGit Requires check previews what would happen at run start for the current dir + workflow + flags. Status maps to the policy:OK(workflow satisfied, OR auto-init would succeed under--git=init --allow-init),Error(hard-fail under auto/require/init),Warn(downgrade under--git=warn),Skip(under--git=off). The check'sHintcarries the exact remediation command (git init,tracker <workflow> --git=init --allow-init, install instructions). -
Library API:
tracker.Config.Git *GitConfigfor embedded callers. Zero value resolves toGitPreflightAuto. TheGitPreflightconstants are re-exported on thetrackerpackage as type aliases ofpipeline.GitPreflightso consumers don't need to import the pipeline package.tracker.WithGitConfig(policy, allowInit)is the equivalent functional option fortracker.Doctor.pipeline.SafetyLatches(ctx context.Context, workDir string) erroris also exported so callers can preview auto-init outcomes without depending onrunAutoInit; ctx threads into the underlying git subprocesses so a canceled caller aborts cleanly. Newtracker.NewEngineWithContext(ctx, source, cfg)is the context-aware constructor —tracker.Run(ctx, ...)and other ctx-aware callers should use it to get end-to-end cancellation coverage of the v0.29.0 git preflight, including the--git=initside effect. -
tracker workflowsnow shows a REQUIRES column per built-in workflow. -
Built-in workflows that commit / branch / merge mid-run declare
requires: git—ask_and_execute,build_product, andbuild_product_with_superspec. Running them in a non-git directory now fails in seconds with a copy-paste remediation message (git init,--git=off,tracker <workflow> --git=init --allow-init), instead of burning $20–$100 of LLM spend before failing at the first git operation.
- dippin-lang dependency bumped v0.25.0 → v0.26.0. Picks up
ir.Workflow.Requires []stringand the parser / formatter support for therequires:workflow header keyword.
-
manager_loopnow propagates ctx-cancellation errors instead of silently returning(OutcomeFail, nil)(#227). When the parent context was cancelled,ManagerLoopHandler.Execute's poll loop had a select race between<-ctx.Done()and<-resultCh: the child engine's handler returnedctx.Err(), the engine wrapped it viafmt.Errorf("handler error at node %q: %w", ...)(preservingerrors.Is(..., context.Canceled)), and sent{result: &EngineResult{Status: OutcomeFail}, err: <wrapped-cancellation>}toresultCh— making both select arms ready. When<-resultChwon,handleChildResult'smsg.result != nilnon-success branch silently discardedmsg.err(designed for strict-failure-edges informational errors) and returned(OutcomeFail, nil)— visually indistinguishable from a normal child failure that conditional edges could route on. Surfaced as aTestManagerLoopHandler_CtxCancellationflake (3/5 runs failed undergo test ./... -short -count=1parallelism) but the underlying bug affected production: atracker runwith amanager_loopnode + Ctrl+C during the first poll cycle could see the handler return a clean "fail" outcome and route through conditional failure edges before the parent engine's next-loopctx.Err()check fired. Fix: targeted cancellation guard at the top ofhandleChildResult— scoped onctx.Err() != nil(i.e. the manager_loop's own ctx was canceled), NOT on the shape ofmsg.err, so a child handler's owncontext.WithTimeoutfiring while the manager_loop ctx is alive still routes through normal failure edges as an ordinary child-internal timeout. Companion fix inexecuteNodecapturesoutcome.ChildUsageinto the trace entry even when the handler returns a non-nil error, so cancelled child runs contribute their accumulated spend to the parent'sAggregateUsageandBudgetGuardrollup. Non-cancellationmsg.err(strict-failure-edges) still drops through to the existing path unchanged. Cancellation event audit message normalized toctx.Err()(matches the<-ctx.Done()arm; pre-fix the two paths emitted different lines for the same observable event). Four deterministic unit tests pin the contract: parent-ctx cancellation, parent-ctx deadline, child-internalDeadlineExceededwith parent-ctx alive (P1 regression guard), and non-cancellationmsg.errpass-through. -
executeNodenow capturesChildUsagefrom handler outcome even on handler error. Previously, when a handler returned both a non-nilChildUsageand a non-nil error (e.g. themanager_loopcancellation path), the engine's error branch settraceEntry.Status = "error"and added the entry without settingtraceEntry.ChildUsage, silently dropping the child's token/cost data fromAggregateUsageandBudgetGuardrollups. -
Preflight + Doctor now catch unborn HEAD (no commits) up front (PR #235 round 7, Copilot:3260568737; probe choice refined in PR #237 round 8, Copilot:3260797018 + CodeRabbit:3260803531 + Codex:3260803910).
git rev-parse --is-inside-work-treereturns true for agit init'd repo that has no commits, so pre-fix arequires: gitworkflow could pass preflight against an unborn-HEAD repo and crash mid-run ongit worktree add ... HEAD,git merge, orgit logafter burning LLM turns. The newpipeline.HasBornHEADprobe runsgit rev-parse --verify HEAD^{commit}after--is-inside-work-treesucceeds —^{commit}forces commit peeling so a HEAD pointing at a non-commit OID (dangling/corrupt) doesn't masquerade as born. Stderr inspection viaisUnbornHEADStderrmatches the two upstream-stable phrases ("Needed a single revision","unknown revision or path not in the working tree") to distinguish the benign unborn case from real failures (corrupt refs, permissions); corruption-class errors surface as wrapped errors rather than collapsing to "unborn." On unborn HEAD Preflight returnsErrGitUnbornHEADand Doctor reports Error with copy-paste remediation (git commit --allow-empty -m initialfor an empty baseline, orgit add . && git commit -m initialto capture existing files). The manual not-a-repo remediation inbuildWorkdirNotRepoMessageoffers both paths explicitly so users with files already in the workdir don't end up with a born-but-empty HEAD that re-trips the worktree workflows.--git=warndowngrades to a warning as before. -
Auto-init refuses in non-empty workdirs (PR #235 round 7, Copilot:3260568814).
--git=init --allow-initcreates an empty initial commit so HEAD is born, but it does NOT stage user files — in a non-empty workdir that left user content outside HEAD, and worktrees created from HEAD by workflows likebuild_product_with_superspecwere silently empty (missingSPEC.md,.ai/decisions/execution-plan.md, etc.). New Latch 3 inrunAutoInit: if the workdir contains any entry other than.git, refuse withErrGitAutoInitRefusedand tell the user to stage their own initial commit (git init && git add . && git commit -m initial) so they control what lands in HEAD. The refusal fires beforegit initruns — the workdir is unchanged on refusal. We don't auto-git add -Abecause user content can include secrets (.env), build artifacts, or anything else they hadn't yet decided to track.pipeline.WorkdirHasContentis exported so the Doctor preview can model the same latch and avoid the false-OK case where Doctor reported success but the runtime would refuse. -
Auto-init
[Y/n]prompt no longer treats EOF as consent (PR #235 round 7, Copilot:3260568794).defaultPromptYNreturnedtruewhenbufio.Scanner.Scan()returnedfalse(EOF or read error), so a piped run with no stdin could satisfy the consent latch without the user typing anything. NowreadPromptYN(the testable inner half) returnsfalseonScan() == false; only an actual empty line — successful read of the user pressing Enter — defaults to yes, matching the uppercase Y in[Y/n]. The test matrix pins all six outcomes (eof_refuses,blank_line_accepts,yes_lower,yes_upper,no_lower,no_upper). -
Spec test-plan corrected (PR #235 round 7, Copilot:3260568849).
docs/superpowers/specs/2026-05-15-tracker-git-preflight-design.mdline 280 previously claimed--git=initwithout--allow-initis caught at flag-parse time. It isn't — the--allow-initrequirement is a preflight-time latch (pipeline.Preflight→runAutoInit), because interactive (TTY) runs may satisfy it via the[Y/n]prompt. Updated to describe the actual behavior and point at the existingTestRunAutoInit_NeedsAllowInit_NonInteractivetest.
Patch release fixing a runaway-agent bug in three of the four built-in workflows. No engine changes.
- Built-in workflows no longer run an unconstrained agent in
Start/Done(#230).workflows/ask_and_execute.dip,workflows/build_product.dip, andworkflows/build_product_with_superspec.dip(and theirexamples/mirrors) defined Start/Done asagentnodes withprompt: Initialize pipeline./prompt: Pipeline complete.. Because the prompt attribute was present,ensureStartExitNodesskipped the passthrough handler and these nodes became real codergen sessions — system message limited to the file-path reminder, full tool access (read/write/bash/glob/edit/grep_search), no per-node turn cap. A realbuild_productrun was observed spending ~10 minutes and ~39k output tokens insideStart, implementing an entire separate Go project from a SPEC.md found on disk, before gettingcontext canceledand being classifiedoutcome: retry. Dropping the prompt lines makes Start/Done passthroughs (matchingdeep_review.dip, which was already correct).dippin doctorscores went A → 100/100 on bothbuild_productfiles;ask_and_executestays at 95 (unrelated warning). The broader engine policy gaps surfaced by this incident —outcome: retryon cancellation, no defaultmax_turnscap, runaway nodes invisible totracker diagnose, missing tool-call args inactivity.jsonl, suspect per-node token accounting — are tracked in #230 for separate follow-up.
Maintenance release picking up dippin-lang v0.25.0's bundle-load fixes. No tracker-side feature changes; no breaking changes.
- dippin-lang dependency bumped v0.24.0 → v0.25.0. Picks up the v1.1
.dipxformat clarifications and three bug fixes that affect tracker's bundle-load path: cycle detection now walks every manifest-listed workflow (was: only entry-reachable, could miss cycles in unreachable workflows thatparseAllWorkflowshad already loaded);dipx.OpenenrichesErrManifestInvalid/ErrUnsupportedFormatVersionerrors with the bundle path;dipx.Packcorrectly classifies subgraph parse failures asErrSubgraphParseinstead ofErrEntryParse. Also adds context-cancellation checks through Open/Pack hot paths.Source.Workflowgained actxparameter (breaking forSource-interface consumers); tracker usesBundle.Entry()/Bundle.Lookup()directly so no code change needed at the call sites.PinnedDippinVersionconstant updated to match.
This release closes the five-issue follow-up arc from the #208 design review. All additions are backward-compatible: the activity-log relocation (#213) falls back to the legacy path for archived runs, and the new lint (#211) / routing channels (#210, #212) are opt-in.
-
Activity log integrity hardening (closes #213). The audit log used to live at
<workDir>/.tracker/runs/<runID>/activity.jsonlmode0o644, reachable via relative path from any tool subprocess running withcmd.Dir = workDir— opening the door to injecteddecision_edgelines, truncatedtool_output_truncatedevents, and forgedpipeline_completedrecords. Two-part fix: (A) live writes now go to$XDG_STATE_HOME/tracker/runs/<runID>/activity.jsonl(default$HOME/.local/state/tracker/, mode0o600;TRACKER_AUDIT_DIRoverride;%LOCALAPPDATA%fallback on Windows) — outside any tool subprocess'scmd.Dir, so the most common LLM-tool-mistake attack vectors (shell redirection from project root,find . -name activity.jsonl) no longer reach it. (B) Every line the runtime writes is prefixed with\x1f\x1e(pipeline.ActivityLogSentinel);tracker diagnoseandtracker.Auditvalidate the sentinel and surface non-sentinel lines asSuggestionAuditLogInjection. OnJSONLEventHandler.Close()a sentinel-stripped snapshot is written to the legacy run-dir path (mode0o644,O_NOFOLLOWon unix) so bundle export and git_artifacts still find a readable JSONL file in the run dir. Pre-#213 runs and archived runs without the secure file fall through to the legacy path viatracker.ResolveActivityLogPathand parse unchanged — backward compatible. The sentinel scheme is detection, not authentication: an attacker who reads tracker's source can emit the bytes, by design. Per-line HMAC (option C in the issue) is explicitly out of scope; the key-management cost is too high for the marginal gain. The threat model is documented in CLAUDE.md under "Activity log integrity." -
_TRACKER_ROUTE=reserved sentinel for convention-based routing (closes #212). Complement tomarker_grep:(#210) for tools that can't change schema or want to opt into typed routing without a node attribute. The runtime scans every tool node's captured stdout for lines matching^\s*_TRACKER_ROUTE=(.+?)\s*$, takes the LAST match's captured value, and populatesctx.tool_route. Anchored on both ends so an arbitrary_TRACKER_ROUTEsubstring inside other text doesn't match; CRLF-tolerant per-line. Author pattern: emitprintf '_TRACKER_ROUTE=tests-pass\n'from the tool once the routing decision is known, then route viawhen ctx.tool_route = tests-pass. New optionalroute_required: truenode attribute opts in to strict mode — when set AND no sentinel was emitted, the node fails withOutcomeFailand emitsEventToolRouteMissing(with the captured stdout tail for diagnosis) rather than silently falling through.ctx.tool_routeis LLM-origin (the subprocess emitted it), so it is not in thetool_commandsafe-key allowlist and cannot be declared as awrites:target (the runtime owns it).tracker diagnosesurfaces aSuggestionToolRouteMissingwith the recommended fix copy. -
TRK101validate-time lint for risky tool-stdout routing (closes #211). New tracker-specific lint rule (TRK1XX namespace, distinct from dippin's DIP1XX) that surfaces the #208 foot-gun shape attracker validateandtracker doctortime — before a pipeline ships. Fires on a tool node when ALL of: (1) routes onctx.tool_stdoutvia exactly one conditional edge, (2) has an unconditional fallback edge, (3) has nomarker_grep:declared, (4) has no explicitoutput_limit:, (5) command body emits volume (teeor2>&1). Suggestsmarker_grep(the #210 structural fix) as the primary remediation, thenoutput_limit:, then splitting the volume-emitting body from the routing-signal printf, then enumerating every expected marker as its own conditional edge. Heuristics tuned for low false-positive rate via sweep acrossexamples/*.dip: skips nodes that also route onctx.outcome(exit code primary signal) and nodes with 2+ conditional edges ontool_stdout(exhaustive enumeration is the safely-structured pattern, as inparallel-ralph-dev.dip'sContractCheck/IntegrationTestvalidators). Eight unit tests pin the positive case and each skip condition individually. -
marker_grep:node attribute on tool nodes (closes #210). Typed routing channel separate fromctx.tool_stdout: the runtime applies the declared regex line-by-line to captured stdout, last match wins, andctx.tool_markeris populated with capture group 1 (or the full match if the regex has no groups). Withmarker_grep: '^tests-(pass|fail)$', routing readswhen ctx.tool_marker = pass(the captured group), not the whole line — explicit intent, not "whatever the tool happened to print last." If you want the full token, drop the group:marker_grep: '^tests-pass$|^tests-fail$'thenwhen ctx.tool_marker = tests-pass. If the regex matches nothing, the node fails withOutcomeFailand emitsEventToolMarkerMissing(with the configured pattern + the last 256 bytes of captured stdout for diagnosis) rather than silently falling through to an unconditional edge — the foot-gun removal that's the whole point. Bad regex on the node surfaces viactx.tool_marker_errorplus a node fail.ctx.tool_markeris LLM-origin (the subprocess emitted it), so it is not in thetool_commandsafe-key allowlist — conditions can read it, but tool_command interpolation cannot. Compatible with the existingoutput_limittail-window: the regex runs over the captured tail, so an end-of-output routing marker survives by construction. -
Property-based tests for
tailBuffer(closes #214). New dev deppgregory.net/rapidv1.3.0 andagent/exec/tail_buffer_property_test.gocover the tail-window invariant across arbitrary write sequences: for any sequence ofWritecalls with totalNbytes andlimitL,tb.String()equals the lastmin(N, L)bytes of the concatenation. A second property pinsTruncated()andBytesDropped()against the same invariant. Generalizes the ~12 hand-rolled example-based boundary tests intail_buffer_test.goto the full state space — catches off-by-one boundary errors, write-boundary state corruption, and ring-buffer wrap-around bugs (the class of bugs PR #215 went through several iterations to get right). 100 random cases per property; fast (< 2ms per property).
- Tool stdout/stderr truncation now keeps the tail, not the head (closes
#208). Pre-fix, the
per-stream 64KB cap in
agent/exec/local.gokept the first 64KB of output, which silently dropped routing markers (printf 'tests-pass') past the boundary — pipeline routing then fell through the unconditional fallback edge and could ship broken code as if it had passed. The notebook_smoke pipeline reproduced this twice in one day. NewtailBufferring-buffer keeps the trailinglimitbytes (O(1) amortized per-byte cost, singlelimit-sized allocation, exact tail match regardless of write boundaries).CommandResultgains structuredStdoutTruncated/StdoutBytesDropped/StderrTruncated/StderrBytesDroppedfields so callers no longer have to pattern-match on an in-band sentinel string. Symmetric for stderr — closes a pre-existing zero-stderr-truncation-tests coverage gap. The in-band"...(output truncated at N bytes)"suffix is gone; consumers must read the new flags (or the newEventToolOutputTruncatedevent, below) to detect truncation. Drops the unintended-defense head pattern surfaced by the security reviewer (head-keep accidentally defended against a different attack — see follow-up issue #212 for the reserved routing-sentinel hardening that closes the new threat-model delta).
-
EventToolOutputTruncatedactivity event (#208 Tier 1). Emitted once per truncated stream after each tool node, withTruncationDetail{Stream, Limit, CapturedBytes, DroppedBytes, TotalBytes}. Written toactivity.jsonlsotracker diagnose,tracker.Audit, and NDJSON consumers can detect truncation retrospectively.tracker diagnosesurfaces aSuggestionToolOutputTruncatedsuggestion explaining the elision, pointing atoutput_limitas the escape hatch, and noting the tail-window preserves trailing routing markers by construction. -
EventConditionalFallthroughactivity event (#208 Tier 2). Fires when at least one conditional outgoing edge from a node was evaluated, all evaluated false, and routing fell through to a fallback (label,suggested, orweight). Carries the list ofConditionEval{EdgeTo, Condition}entries that missed. Does NOT fire on intentional all-unconditional routing — distinguishes "stated routing intent missed" from "fallback is the only option."tracker diagnosecorrelates this withEventToolOutputTruncatedon the same node and surfaces a combined suggestion when both fire — the canonical diagnostic narrative for the #208 failure shape ("your routing marker may have been dropped"). -
Five follow-up issues filed for the broader hardening surface (#210 marker_grep primitive · #211 validate-time lint for risky stdout-routing patterns · #212
_TRACKER_ROUTEreserved sentinel · #213 activity.jsonl integrity · #214 property tests viapgregory.net/rapid). Each came out of the 6-expert design panel that reviewed the #208 proposed fixes.
- Native
.dipxbundle support (closes thedocs/requests/native-dipx-bundle-support.mdrequest from the pipelines team). Tracker now accepts content-addressed.dipxbundles (produced bydippin pack) anywhere it accepts a pipeline file:tracker validate,tracker simulate,tracker run,tracker doctor, andtracker -r <runID>resume. Pre-fix, tracker read the bundle's ZIP bytes as.dipsource and failed with bogusDIP001/DIP002validation errors — the runtime didn't share dippin's understanding of the format, so the integrity guarantees, single-artifact distribution, and audit-trail provenance value of.dipxonly landed at lint time. Newpipeline.LoadDipxBundleopens the bundle viadipx.Open(SHA-256 verifies every file inmanifest.jsonbefore any content reaches the parser), uses the bundle's pre-parsed*ir.Workflowdirectly (no re-parse of bundled sources), and bypasses the filesystem subgraph walker entirely since dipx already verifies ref closure + acyclicity onOpen. The bundle's content-addressed identity (sha256:<hex>) is stamped onto every line ofactivity.jsonl(engine emissions, parallel/manager_loop emissions that bypass the engine's emit chokepoint, and agent/llm JSONL writes that bypass both — three composable layers so every line of audit output carries provenance), persisted intocheckpoint.jsonfor resume verification, and surfaced intracker list(newBundlecolumn) andtracker audit(newBundle:header line). Bundle identity is exposed ontracker.Result.BundleIdentityandtracker.RunSummary.BundleIdentityfor embedded library callers. Resume against a.dipxstrictly verifies the stored identity matches the one being resumed — mismatch aborts with both hashes shown so the operator can pick the right artifact;--force-bundle-mismatchis the escape hatch (loud warning to stderr). Bare-name resolution (tracker build_product) still resolves.dipfirst, then file, then built-in —.dipxis dispatched explicitly by extension on full paths. Because the identity is computed deterministically over manifest bytes and verified on everyOpen, atracker validatepass on a CI bundle gives the same answer as the production run.
- dippin-lang dependency bumped v0.23.0 → v0.24.0 for the new
dipxpackage (Open,Bundle.Workflow,Bundle.Identity).PinnedDippinVersionintracker_doctor.goupdated to match sotracker doctor's version-mismatch check reflects the new pin. pipeline.LoadDipxBundlenow returns diagnostics instead of writing toos.Stderr. The library API no longer prints to the process-global stderr; the signature gains a[]validator.Diagnosticreturn so embedded callers can route them through their own logger. CLI callers (cmd/tracker/loadDipxPipeline,tracker doctor's bundle check) print to stderr as before. Mirrors the existingpipeline.LoadDippinWorkflowcontract for the.dippath.
-
Gemini SSE parser coalesces split finish + usage chunks into a single
EventFinish. Follow-up polish to the earlier trailing-usage fix: when an upstream emits the finish reason and theusageMetadatain two separate chunks (the 2389 Bedrock Gateway does this; real Google can too), the parser now buffers the finish reason ingeminiStreamState.pendingFinishinstead of emitting it immediately. When the trailing usage chunk arrives, both are emitted together as one event. AflushPendingFinishhelper on*geminiStreamStateguarantees the buffered reason is emitted before every early-return path — clean stream exit, scanner error, and JSON parse error — so partial-failure streams still produce a terminalEventFinishahead of theEventError, preserving the prior behavior for accumulator bookkeeping. The combined-chunk path also defensively clearspendingFinishto guard against a hypothetical split-then-combined upstream emitting a duplicate finish at stream end. Net effect: thellm finishtrace line now prints exactly once per turn regardless of upstream chunking shape, fixing the duplicate-line cosmetic artifact called out in the Fixed entry below. Four new regression tests pin the behavior end to end (TestAdapterStreamTrailingUsageChunkEmitsSingleFinishfor the split case;TestAdapterStreamFinishWithoutUsageChunkfor the no-trailing-usage case;TestAdapterStreamCombinedAfterSplitClearsPendingfor the defensive pending-clear;TestAdapterStreamParseErrorFlushesPendingFinishfor the parse-error flush ordering). Also extracts ausageFromMetahelper since the samegeminiUsageMeta→*llm.Usageconversion now happens at three call sites. -
Bedrock Gateway integration guide refreshed for upstream gateway fixes #4 and #5 (closed 2026-04-30). The gateway now accepts both Cloudflare AI Gateway native routing prefixes (
/anthropic,/openai,/google-ai-studio,/compat) and Gemini's/v1beta/models/...paths, so tracker's--gateway-urlflag works end-to-end againsthttps://bedrock-gateway.2389-research-inc.workers.devandprovider: geminiis no longer broken. Smoke-tested with a single-agent dip pipeline:provider: anthropicandprovider: geminiboth completed against the live gateway.docs/bedrock-gateway.mdrewritten to lead with the recommended--gateway-urlrecipe; the old "Why not--gateway-url?" section removed; the compatibility matrix flips Gemini to working; the "404 on every request" and "Gemini/v1beta404" troubleshooting entries dropped. Theprovider: openai(Responses API) row stays as broken pending gateway #3, which was reopened after we discovered it had been auto-closed by an unrelated commit's "Fix #3" wording referring to a bot-review item, not the GitHub issue.
- Gemini token usage no longer reports 0 when the upstream emits
usageMetadataas a standalone trailing SSE chunk. Tracker'sllm/google/adapter.goSSE parser bailed on any chunk with nocandidatesarray, which dropped trailing usage-only chunks on the floor — soStreamAccumulatoronly saw the candidate chunks (with no usage attached) and the finalUsage{}came out empty. Surfaced while smoke-testing tracker against the 2389 Bedrock Gateway where the gateway's:streamGenerateContent?alt=ssereply is three chunks: text →finishReason:"STOP"→usageMetadata. The accumulator contract already supportsprocessFinishbeing called twice (first setsfinishReason, second updatesusagewithout overwriting reason), so the fix is a 10-line patch inprocessSSELine: when a candidate-less chunk carriesUsageMetadata, emit a usage-onlyEventFinish. End-to-end verified against the live bedrock gateway — a single-agentprovider: geminismoke run now reports1,408 in / 4 outinstead of0 in / 0 out, and tracker's per-provider cost rollup is correct (no double-counting becauseAggregateUsagefolds per-nodeSessionStats, not perTraceEvent). Net visible artifact: thellm finishtrace line now prints twice on affected gateways — first withreason=stopand no tokens, second withtokens=N/Nand no reason — but the final accumulated state is correct. New regression testTestAdapterStreamTrailingUsageChunkpins the trailing-chunk case end-to-end throughStreamAccumulator.Response().
-
Architect-side machinery for local codegen (PR #198). New agent-tool primitive
TerminalToollets a tool flag itself as the terminal step of an agent session — the runtime breaks the loop the moment it succeeds (after the same turn's tool batch, but before the next LLM call), avoiding wasted post-dispatch turns. Newagent/tools/dispatch_sprintsreads a{path, description}JSONL plan and runs the per-sprint author+audit pipeline once per line via a deterministic in-tool loop with bounded retry+backoff for retryable provider errors (5xx / rate-limit / timeout / network); non-retryable errors bubble out so the agent can react. Newagent/tools/write_enriched_sprintcalls a mid-tier LLM (Sonnet by default) once per sprint with a 4-strategy SEARCH/REPLACE matcher (exact → indent-preserving → whitespace-insensitive → fuzzy with Levenshtein ratio ≥ 0.9), partial-apply semantics that distinguishPATCHED-PARTIALfrom cleanPATCHED, and a tolerant audit-verdict parser that handlesAUDIT-VERDICT:anywhere in the first 10 non-empty lines (markdown decoration, leading prose, fence-wrapped output all tolerated). Companionagent/tools/generate_codecalls a cheap/fast model (defaultgpt-4o-mini, override viaTRACKER_CODEGEN_MODEL) to expand a contract into one or more files. All four tools land via env-gated registration inpipeline/handlers/backend_native.gokeyed onTRACKER_SPRINT_WRITER_MODEL/TRACKER_CODEGEN_MODEL. Validated end-to-end on Notebook synthetic (41/41 pytest passing, ~$2, 28min) and NIFB architect-only (16 sprints, Pattern B autonomously, ~$5, 47min). Includes path-traversal guards (newresolveUnderRoothelper with symlink evaluation) covering both write paths and contract-file reads, and uniform reservation of theCompleterinterface across the agent and tools packages via a type alias to prevent silent divergence. -
Self-healing JSON extraction cascade for declared writes (PR #201). When an LLM responds with prose instead of valid JSON for a node with
writes:, the runner now attempts: (1) direct JSON parse; (2) extraction of any...fenced block whose content parses as a JSON object — iterating fences via a strict-shape regex so atext/bashpreamble doesn't block discovery of a laterjsonfence and stray inline backticks in prose don't kick off extraction; (3) balanced-brace scan for the first top-level{…}span that parses as an object (handles prose with stray brace pairs around real JSON without picking the wrong span;{inside JSON-string values and inside[…]arrays are correctly skipped via state tracking); (4) single-key fallback to the raw response with awrites_warningso the pipeline survives. Multi-key writes still hard-fail since prose can't be distributed. The fallback is gated on "no extractable JSON found" — a model that returned valid JSON missing the declared key gets a hard contract failure with a specific error, not a silent fallback. Fallback values are capped at 8 KiB to keep large tool stdout out ofstatus.json/activity.jsonl/ checkpoints. Driven by ananalyze_specfailure on the NIFB run where the agent wrote.ai/spec_analysis.mdbut responded "Done — …" in prose; the runner used to hard-fail on the first character of the response, now heals and surfaces a warning. -
Bedrock Gateway integration guide (PR #200). New
docs/bedrock-gateway.mdwalks through pointing tracker at the 2389 Bedrock Gateway Cloudflare Worker — per-provider*_BASE_URLrecipes, a provider compatibility matrix (anthropic and openai-compat work; openai's Responses API and gemini's/v1betapaths don't, with workarounds), authentication via Cloudflare AI Gateway tokens, and verification guidance pointing at the CF AI Gateway dashboard rather thantracker doctor(which doesn't echo the resolved base URL).
writes:declarations are rejected when they collide with reserved key names (PR #201). Two reserved sets: (a) thetool_commandsafe-key allowlist (outcome,preferred_label,human_response,interview_answers), exposed via the newpipeline.IsToolCommandSafeCtxKeyaccessor — letting a workflow declarewrites: outcomewould funnel LLM-controlled content into a reserved name and bypass the sanitization that keeps LLM output out of shell input; (b) the writes-signal keys (writes_error,writes_warning) — runtime observability thattracker diagnoseandwhen ctx.writes_error != ""edges branch on; allowing a workflow to set them via writes would let an LLM spoof failure/healed signals. Collision rejection runs before any value is written and fails the node. No existing pipelines used these collisions.
tracker doctorprovider probe restored to 16-token max output (PR #199, mdagost). The probe had been usingmaxTok := 1, but OpenAI's Responses API requiresmax_output_tokens >= 16and returns HTTP 400 (Invalid 'max_output_tokens': integer below minimum) below that — breakingtracker doctorfor OpenAI keys entirely.
- ACP
CreateTerminalnow validates commands against the built-in denylist and constrainscwdto the working directory (PR #197). Previously an LLM-directed ACP agent could execute arbitrary commands viaCreateTerminal, completely bypassing the denylist/allowlist that protectstool_command. Bare denylisted commands (e.g.evalwith no args) are also blocked. Error code corrected to-32602(Invalid Params) matchingReadTextFile/WriteTextFile. - Claude Code backend kills subprocess process group on pipeline cancellation (PR #197). Added
SysProcAttr.Setpgid,cmd.Cancel(SIGKILL to process group), andWaitDelayto prevent orphanedclaudesubprocesses consuming API credits after ctrl-C or budget breach. TRACKER_PASS_API_KEYSnow requires=1instead of any non-empty value (PR #197). PreviouslyTRACKER_PASS_API_KEYS=falseor=0silently leaked all API keys to the claude subprocess.tracker doctorenv warning updated to match.- Engine fails on unknown outcome status instead of treating as success (PR #197). The
default:case inhandleOutcomeStatuspreviously calledMarkCompleted, silently promoting handler bugs to success. Now emitsEventStageFailedand setsOutcomeFail. - Pipeline goroutine panic recovery (PR #197).
runPipelineAsyncnow hasdefer/recoverso a handler panic produces a clean error instead of crashing the TUI without checkpoint save. PinnedDippinVersionupdated tov0.23.0to matchgo.mod(PR #197).tracker doctorwas telling users to install v0.21.0.DefaultModelupdated toclaude-sonnet-4-6(PR #197). Was stillclaude-sonnet-4-5.- Autopilot LLM calls now respect pipeline context cancellation (PR #197). All call sites used
context.Background()— pipeline cancellation had no effect on in-flight autopilot requests. NewContextSetterinterface threads the pipeline context without changing theLabeledFreeformInterviewercontract. - Example
manager_loop_child.dipupdated forsteer.*namespace (PR #197). References${ctx.steer.hint}instead of the broken${ctx.hint}after PR #196's rename. escapeOsascriptnow escapes newlines to prevent injection in macOS notification strings (PR #197).- Removed stale comment in
human.gothat incorrectly claimed CLAUDE.md was wrong aboutquestions_key(PR #197).
stack.manager_loopsteer_contextkeys are now namespaced understeer.*(closes #177). Previously a manager_loop'ssteer_context: { outcome: "fail" }injected a bareoutcomekey into the running child'sPipelineContext, which collided with the four safe-allowlisted bare ctx keys (outcome,preferred_label,human_response,interview_answers) thattool_commandvariable expansion permits. The threat: todaysteer_contextis static at.dipparse time so collisions are author-controlled, but if a future feature lets steer values come from LLM output an attacker-controlled value could reach a shell command via${ctx.outcome}. Fix is option B from the issue: a newnamespaceSteerKeyshelper inpipeline/handlers/manager_loop.gorewrites every parsed key with theSteerContextKeyPrefix = "steer."prefix before it lands incfg.steerKeys, so the collision is impossible by construction — bare safe-allowlist keys stay reserved for legitimate node-level outcomes, steered values flow throughsteer.*and are blocked from tool_command expansion (the namespace isn't on the allowlist). The transform is idempotent (already-namespaced keys aren't double-prefixed) and applies uniformly viaparseManagerLoopConfig. Authors who want to read steered values in prompts / conditions /--max-costlookups now reference${ctx.steer.<key>}. Behavior change: any pipeline that today reads a steer-injected value via the bare-key form (e.g.${ctx.hint}aftersteer_context: { hint: "..." }) needs updating to${ctx.steer.hint}. Mixed-form input (hint=a,steer.hint=bin the samesteer_context) is rejected at parse time withErrAmbiguousSteerKeyrather than picked nondeterministically by Go map iteration order. Five regression tests pin (a) bare keys get prefixed, (b) the transform is idempotent and nil-safe, (c) attempting to steer one of the four safe-allowlist keys (outcome,preferred_label,human_response,interview_answers) lands assteer.<safekey>so the bypass is closed end-to-end, and (d) the bare/prefixed collision case is rejected loudly.
- Claude Code backend now reports cache-token usage from the NDJSON result envelope (closes #185 Track A). The Claude CLI already emits
cache_read_input_tokensandcache_creation_input_tokensin itsresultNDJSON message, butstoreResultwas silently dropping them — sollm.EstimateCostpriced every input token at the fresh rate. For the canonical heavy-cache workload (Sonnet 4.5 + CLAUDE.md injection on every turn with stable prompt caching, typically 60–90% cache-read by input token count) that resulted in a ~3× overcount on the input side of per-node cost. Fix:ndjsonUsagegainsCacheReadInputTokens+CacheCreationInputTokensJSON fields;storeResultpopulates the matching*intpointers onllm.Usagewhen non-zero soEstimateCostprices cache reads at 10% and cache writes at 25% of the input rate (Anthropic pricing convention).TotalTokensstays fresh-input + output to match the convention inllm/anthropic/translate_response.go— cache tokens are tracked separately, priced independently, and deliberately kept out of the token total soBudgetGuard's--max-tokenssemantics stay consistent across backends. Two new regression tests pin the populated-from-NDJSON case and the back-compat case (no cache fields → nil pointers, unchanged total).
TRACKER_ACP_CACHE_READ_RATIOenv var for ACP cost-estimate tuning (closes #185 Track B). The ACP protocol doesn't report cache tokens and the tracker-side heuristic can't observe them, so estimated ACP input was priced entirely as fresh — conservative (never under-reports) but up to ~3× high for workloads where the bridge keeps a stable context cached. SettingTRACKER_ACP_CACHE_READ_RATIOto a value in(0, 1]tellsestimateACPUsagewhat fraction of the estimated input tokens to route toCacheReadTokens(priced at 10% of the input rate) instead ofInputTokens. Typical values:0.5–0.8for stable-context Claude workloads. Default (unset or out-of-range) keeps the conservative behavior. Out-of-range values log a one-time warning and are ignored. Seven regression tests pin the split math across unset, sub-1, exactly-1, negative, >1, and non-numeric inputs.--tool-denylist-add <glob>CLI flag +tool_denylist_addgraph attribute (closes #168; completes the deferredWorkflowDefaults.ToolDenylistAddadapter wiring from v0.24.0 #181). Operators and workflow authors can now extend the built-in tool-command denylist (eval, pipe-to-shell, curl|sh, etc.) with additional glob patterns for defense in depth — previously the only way to block a new pattern without forking tracker was to restrict via--tool-allowlist, which inverts the default.CheckToolCommandnow takes an extra-deny-patterns arg that checks alongside the built-ins. Interaction rules: user-added patterns cannot remove any built-in,--bypass-denyliststill disables everything (built-in + user-added — it's the all-or-nothing escape hatch), and user-added patterns are evaluated before the allowlist so a command must pass both gates. Plumbing mirrors the allowlist exactly: repeatable CLI flag with comma-separated value support,handlers.GraphAttrToolDenylistAdd = "tool_denylist_add"constant,mergeToolDenylistAddunion-with-dedup of CLI + graph patterns, adapter-side wiring fromir.WorkflowDefaults.ToolDenylistAddintograph.Attrs["tool_denylist_add"],parseGraphCommaListshared parser factored out so the allowlist and denylist-add paths can't drift on whitespace/trim semantics. Help text + preamble logging note the security posture (additive block for defense in depth;--bypass-denyliststill disables).- Estimated-usage flag plumbed from ACP backend through trace → CLI → TUI → NDJSON (closes #186). The
ACPUsageMarkerintroduced in v0.24.0 was written intollm.Usage.Rawbut had no downstream readers —llm.Usage.AddandbuildSessionStatsboth droppedUsage.Raw, so the CLI summary, TUI header, and NDJSON cost events saw a single dollar figure with no way to distinguish heuristic ACP spend from metered native/claude-code spend. Fix:pipeline.SessionStatsgainsEstimated bool+EstimateSource string;pipeline.ProviderUsageandpipeline.UsageSummarygainEstimated bool;pipeline.CostSnapshotgainsEstimated bool.buildSessionStatscalls a newextractEstimateMarkerhelper to populateEstimated/EstimateSourcefromUsage.Rawbefore the value is lost.Trace.AggregateUsageOR-propagates the flag across sessions and child-usage rollups — a single estimated session taints both its per-provider bucket and the summary-level flag, so a mixed native+ACP run is correctly labeled as "not fully metered". Surfaces: CLI "Tokens by Provider" table suffixes estimated providers with(estimated)and renders total cost as~$X.XXXX (estimated — heuristic spend on at least one provider);printTotalTokensnow emits~$X.XX usagewhenever any session was heuristic (not just the pre-existing Max-subscription-only case); TUI header's cost badge prefixes with~for estimated runs; NDJSONcost_updatedandbudget_exceededevents carryCostSnapshot.Estimated. Three new test suites cover the propagation —TestBuildSessionStats_PropagatesACPEstimatedMarkerin transcript_test.go,TestTraceAggregateUsage_EstimatedPropagationin trace_test.go (4 sub-tests), andTestPrintTotalTokens_*in cmd/tracker (3 tests). Not in scope (per the issue): changes tollm.Usage.Add'sRawhandling — the flag is now carried bySessionStatsforward;Usage.Rawremains an implementation detail only read byextractEstimateMarkerat the single point whereagent.SessionResultis consumed.
- ACP estimator counts reasoning chunks and tool-call payloads (closes #184). Previously
estimateACPUsageonly saw the collected assistant text (handler.textParts), so multi-turn tool-heavy sessions systematically under-reported usage — often by 10–100× for the canonical coding-agent workload (extended-thinking models, repeated tool loops).acpClientHandlernow tracks three additional rune counters advanced at event time:reasoningRunes(advanced byhandleThoughtChunk),toolArgRunes(advanced byhandleToolCallStartfrom the JSON-formattedRawInput), andtoolResultRunes(advanced byhandleToolCallUpdateon completed or failed status from the tool's content +RawOutput). Counters areint— we store sums, not the underlying text — so memory cost is O(1) per channel regardless of output volume.estimateACPUsagefolds them in: reasoning + tool-args contribute toUsage.OutputTokens(matching how providers price extended thinking today), tool-results contribute toUsage.InputTokens(the bridge re-sends tool output as next-turn input context), and reasoning additionally populatesUsage.ReasoningTokensfor future catalog-level per-reasoning pricing. The remaining intrinsic undercount — bridge-injected system prompt + tool-schema definitions — is documented indocs/architecture/backends.mdand requires a bridge-specificMetaextension we don't have. - ACP backend surfaces approximate per-prompt token usage (closes #167). The Agent Client Protocol spec (github.com/coder/acp-go-sdk v0.6.x) has no usage surface —
PromptResponsecarries onlyStopReason+Meta, and noSessionUpdatesubtype reports tokens — so ACP-backed nodes previously returnedSessionResult.Usagezero-valued.CodergenHandler.trackExternalBackendUsageroutes ACP usage tollm.TokenTracker.AddUsage("acp", ..., model)(the model arg is new this release — see theclaude-code/acpProvider-wiring bullet below).estimateACPUsagesynthesizesllm.Usagefrom rune counts (UTF-8 aware viaunicode/utf8;ceil(runes/4)applied per side) and populatesEstimatedCostviallm.EstimateCost. The estimator's channel coverage is described in full in the #184 entry above; the initial cut counted only the assistant text stream and the PR #189 follow-up extended it to reasoning + tool-call argument/result payloads. Remaining intrinsic undercount: the bridge's own injected system prompt + tool schemas are invisible to the heuristic (they never flow throughcfg.Prompt/cfg.SystemPrompt). A one-time log line perACPBackendinstance announces that ACP token/cost numbers are estimates.--max-tokensnow enforces against ACP sessions;--max-costenforces whencfg.Modelis a catalog-known ID (seeEstimateCostwarning below).Usage.Rawis tagged withACPUsageMarker{Estimated:true, Source:"acp-chars-heuristic", Ratio:4}for consumers that inspectSessionResult.Usagedirectly, butllm.Usage.Addandpipeline/handlers/transcript.go:buildSessionStatsboth dropUsage.Raw, so the marker is currently write-only from the trace/CLI/TUI perspective — plumbing an explicit "estimated" flag throughSessionStats/ProviderUsage/the TUI header is tracked as a follow-up. Providerfield now set onSessionResultforclaude-codeandacpbackends. Previouslybackend_claudecode_ndjson.storeResultandbuildACPResultleftSessionResult.Providerempty, which causedpipeline.Trace.AggregateUsageto bucket their usage under the"unknown"provider in per-provider rollups and CLI summaries. Set to"claude-code"/"acp"respectively, matching whattrackExternalBackendUsagealready uses as theTokenTrackerprovider key. Dashboards and library consumers readingEngineResult.Usage.ProviderTotalswill now see a populated"claude-code"/"acp"bucket instead of everything collapsing into"unknown".trackExternalBackendUsagenow threadscfg.ModelintoTokenTracker.AddUsagefor theclaude-codeandacpbackends. Previously the model arg was omitted, soTokenTracker.CostByProvider's resolver fell back tograph.Attrs["llm_model"](often empty for workflows that set models per-node) and priced at $0. As a result, library consumers readingtracker.Result.Cost.ByProvider["claude-code"|"acp"]saw$0.00even when the session computed a nonzeroEstimatedCost, andBudgetGuard's--max-costceiling was silently non-binding for those backends. Both paths now price correctly against the model the node actually ran under.llm.EstimateCostlogs a one-time warning per unknown model whenGetModelInforeturns nil and usage is non-zero. Previously returned$0silently, which violates the project's "never silently swallow errors" rule (CLAUDE.md) and hid the real consequence:--max-costceilings can't apply to usage priced under a model that isn't in the catalog. The warning names the unknown model once and spells out the budget implication.- Built-in example pipelines for
stack.manager_loop(closes #175).examples/manager_loop_demo.dip+examples/subgraphs/manager_loop_child.dipexercise the fullsubgraph_ref+ poll interval + steering path against a real child pipeline. Both grade A viadippin doctor, and the Makefile doctor target runs them so adapter-path regressions on the new v0.22.0 IR attrs trip CI instead of silently rotting. - Diagnostic warning when both unprefixed + legacy
manager.*attrs are set on the same manager_loop node (closes #176). Surfaces accidental shadowing (author migrates some attrs to the v0.22.0 unprefixed contract but leaves the legacy form in place) without changing the unprefixed-wins precedence. warnUnknownStackChildKeysdiagnostic onstop_conditionandsteer_conditionexpressions (closes #176). Scans forstack.child.<word>references and warns when the subkey isn't one of the three tracker actually publishes (status,cycles,exit_status). Catches typos that would silently evaluate to empty.
- dippin-lang dependency bumped v0.22.0 → v0.23.0. Upstream ships DIP28 tool-safety defaults:
ir.WorkflowDefaultsnow exposesToolCommandsAllowandToolDenylistAddfields so.dipauthors can declare tool-safety constraints at the workflow level instead of reaching for DOT or the library API.extractWorkflowDefaultsinpipeline/dippin_adapter.gowiresWorkflowDefaults.ToolCommandsAllow→graph.Attrs["tool_commands_allow"](the consumer side has been ready since #164). Closes the adapter-side follow-up noted in v0.23.0's own #164 entry.ToolDenylistAddwiring is deferred until the matching--tool-denylist-addCLI flag lands (#168). - Docs relocated under
docs/architecture/(closes #165).docs/pipeline-context-flow.md→docs/architecture/context-flow.mdanddocs/manager-loop.md→docs/architecture/handlers/manager-loop.md. Every inbound link inREADME.md,ARCHITECTURE.md,CLAUDE.md,CHANGELOG.md, and thedocs/architecture/tree is updated; thehandlers.md"tracked in #165 for a later PR" placeholder is removed andarchitecture/README.md's "may move underarchitecture/in a later PR" note is retired.
stack.manager_loopnodes no longer bypass--max-tokens/--max-costbudgets (closes #188). Same shape of bug as #183 / PR #187 fixed for the subgraph handler:ManagerLoopHandler.Executewas constructing its child engine withoutWithBudgetGuard+WithBaselineUsage, and the handler'sOutcomereturned noChildUsage. Operator-configured token and cost ceilings were therefore silently non-binding for any work nested in a manager_loop supervisor — the canonical place where long-running token piles form, since manager_loop is specifically designed for cycle-heavy async supervision (Attractor spec 4.11). Fix mirrors PR #187:Executenow readspipeline.ChildRunContextFromContext(ctx)and threads the parent'sBudgetGuard+ baseline usage into the child engine, andhandleChildResultsetsOutcome.ChildUsage = result.Usageon every return path (success, fail, budget-exceeded). A child-sideOutcomeBudgetExceededis mapped to parentOutcomeSuccess(with ChildUsage attached) — the same strict-failure-edges avoidance reasoning as the subgraph fix. Three new regression tests mirror the subgraph suite's coverage: usage rollup into parentProviderTotals, delayed parent-halt after the manager_loop overspends, and mid-loop child-guard halt via baseline + partial trace exceeding the ceiling.- Subgraph nodes no longer bypass
--max-tokens/--max-costbudgets (closes #183). Pre-fix, a pipeline author could place cost-intensive nodes inside a subgraph and both the token and cost ceilings became silently non-binding: the childpipeline.Enginewas constructed withoutWithBudgetGuard, so its between-node checks were no-ops; andSubgraphHandler.Executereturned anOutcomewith no usage rollup, so the parent trace'sAggregateUsagemissed all child spend, preventing the parent's guard from firing either. Fix: (a)OutcomeandTraceEntrygain a newChildUsage *UsageSummaryfield;Trace.AggregateUsagefolds it into both the running totals and per-provider buckets so parent-level rollups see child spend; (b) the engine stashes itsBudgetGuardplus a snapshot of already-consumedUsageSummaryonctxviaChildRunContextFromContext(only when a guard is configured — no overhead for unbudgeted runs), so handlers that launch child runs can propagate them; (c) the engine gainsWithBaselineUsage(*UsageSummary), which folds an external baseline into the child'scheckBudgetAfterEmitsnapshot — child guards now evaluateparent-consumed + child-traceagainst the limits, matching the operator's intent; (d)SubgraphHandler.Executewires its child engine withWithBudgetGuard+WithBaselineUsagefrom the ctx, and returnsOutcome.ChildUsage = result.Usageregardless of child outcome. A child-sideOutcomeBudgetExceededis propagated to the parent asOutcomeSuccesswith child usage attached so the parent's own guard fires on the next between-node check (returningOutcomeFailhere would trip the strict-failure-edges rule before the budget check could run). Four regression tests pin the three enforcement paths (parent-level rollup, late parent-halt after subgraph overspends, mid-subgraph child halt via baseline) and a two-level-nested case. Not yet addressed: mid-stream enforcement inside a singlePrompt()call — the guard still fires only between nodes; andmanager_loophandler has the same shape and likely needs the same treatment (filing as a follow-up). - CLAUDE.md
questions_keydefault matches code (closes #163). CLAUDE.md § Interview mode now accurately statesquestions_keydefaults tointerview_questionswithlast_responseas a read-time fallback insideresolveAgentOutput. Previously claimedlast_responseas the primary default, which contradictedresolveInterviewKeysinpipeline/handlers/human.go. The drift-note block indocs/architecture/handlers/human.mdflagging this mismatch is removed. - "Escalation" terminology reconciled across docs (closes #166). CLAUDE.md § Claude Code backend no longer lists
escalateas a pipeline outcome (actual outcomes:success,fail,retry, plus engine-levelbudget_exceeded) and cross-links todocs/architecture/engine.md#escalatefor the routing-convention framing. The outcome table incontext-flow.mdis updated to match. This completes the audit started inengine.md:370which already had the canonical "not a distinct outcome status" framing. steer_contextkeys with:rejected at adapter time (closes #171). Dippin-lang's block-form formatter writes entries askey: value, so a colon in asteer_contextkey breaks.dip → IR → .dipround-trip; the upstream parser drops such keys with a diagnostic.flattenSteerContextinpipeline/dippin_adapter.gonow returnsErrInvalidSteerContextKeyso authors fail loudly at graph-build time instead of silently losing keys downstream.manager_loopnodes with nilir.ManagerLoopConfigfail at graph-build time (closes #174).convertNodepreviously let a nil Config flow throughextractNodeAttrsas a no-op, producing a graph node withoutsubgraph_refthat only surfaced at Execute-time as a vague "subgraph not found" error. ReturnsErrMissingManagerLoopCfginstead. Scoped tomanager_looponly; same pattern may extend to other kinds in follow-ups.- Adapter rejects Parsed-only conditions that format to parenthesized expressions. The pipeline edge evaluator tokenizes on plain
strings.Split("||")/strings.Split("&&")and does not support parens —a || (b && c)silently mis-evaluates as unknown variables with empty-string results.convertEdgenow returnsErrParenthesizedParsedConditionat adapter time so authors get a hard error up front; workaround is to populateCondition.Rawwith a flat form (a=1 || b=2 || c=3) or simplify the Parsed tree to not emit parens.
formatManagerLoopConditionExprnow emits&&/||instead of Englishand/or(PR #170 round-2 review; closes part of #172). The formatter is called when anir.Conditionhas onlyParsedpopulated (Raw empty), producing the text that flows intopipeline.EvaluateCondition. The evaluator only recognizes Go-style boolean operators, so a Parsed-only fallback was silently mis-evaluated as a single opaque clause. Programmatically-built IR workflows that didn't populateRaware now correctly evaluated.CondNotcontinues to emitnot(the evaluator's native negation). New testTestFormatManagerLoopCondition_EvaluatorCompatibilitypins the formatter→evaluator round-trip forCondAnd,CondOr, andCondNot.managerAttruses comma-ok lookup so an explicit empty string on the unprefixed key wins over a non-empty legacymanager.*value (closes #173). The previous zero-value check (if v := attrs[key]; v != "") silently fell through to the legacy prefix, letting authors accidentally resurrect values they thought they had cleared. New testTestManagerAttr_EmptyStringPrecedencepins all four combinations (explicit empty, missing, legacy-only, unprefixed-wins).parseManagerLoopConfigdistinguishes "empty" from "invalid"steer_context(PR #170 round-2 review). Whensteer_conditionis set andsteer_contextparses to zero entries, the error now reports "steer_context %q is invalid" with the raw value if it was non-empty, and "steer_context is empty — nothing to inject" only when truly unset.tool_commands_allowgraph attribute is now wired into the tool handler allowlist (closes #164). CLAUDE.md documented this path ("--tool-allowlistCLI flag ortool_commands_allowgraph attr"), but the graph-attr side was never plumbed.registerToolHandlernow readsgraph.Attrs["tool_commands_allow"](comma-separated glob patterns, whitespace tolerant), unions it with the CLI-supplied--tool-allowlistpatterns, and passes the combined list toNewToolHandlerWithConfig. Authors can set the attr via DOT (graph [tool_commands_allow="git *,make *"]) or programmatically onGraph.Attrs; denylist-wins invariant is preserved (a graph attr of*does NOT unblockevalorcurl | sh). Dippin-lang IR does not yet expose this field —.dipauthors must use DOT or the library API until upstream shipsir.WorkflowDefaults.ToolCommandsAllow.
ir.NodeManagerLoopadapter support + dippin-lang v0.22.0 bump (closes #162)..dipauthors can now declarestack.manager_loopsupervisors directly via the new IR kind.pipeline/dippin_adapter.gomapsir.NodeManagerLoop→shape=house→handler=stack.manager_loopand flattensir.ManagerLoopConfiginto the six unprefixed DOT attrs the handler consumes:subgraph_ref,poll_interval,max_cycles,stop_condition,steer_condition,steer_context.steer_contextuses canonical sortedk=v,k=vwith percent-encoding for the three reserved chars (,→%2C,=→%3D,%→%25) — mirrors dippin-lang v0.22.0export.flattenSteerContextexactly so DOT round-trips (adapter ↔ dippin-lang migrator) stay lossless. When a manager_loop is the workflow's Start or Exit,ensureStartExitNodesoverrides the shape toMdiamond/Msquarebut the handler (stack.manager_loop) and flat attrs are preserved so the supervisor still executes. TheManagerLoopHandlernow accepts both the unprefixed v0.22.0 contract names and the legacymanager.*prefixed variants for backward compatibility; unprefixed wins when both are set.parseSteerContextpercent-decodes reserved chars so lossless round-trips complete through the handler. Semantic note:PollInterval == 0andMaxCycles == 0in the IR degrade to tracker's handler defaults (45s / 1000) rather than the IR-documented "event-driven" / "unbounded" modes; tracker has no such modes today. Partial steering configs (steer_conditionwithoutsteer_context, or vice versa) are now rejected at parse time — previously one half of the pair would silently render the supervisor inert.--bypass-denylist,--tool-allowlist,--max-output-limitCLI flags for tool command sandboxing. The underlying denylist, allowlist, and per-stream output ceiling were already enforced bypipeline/handlers/tool_safety.goandToolHandlerConfig, but only via node-attr and library APIs — the CLI paths were missing.--bypass-denylist(bool, defaultfalse) disables the built-in denylist and prints a loud stderr warning on startup; use only in sandboxed environments where dangerous patterns (eval, pipe-to-shell, curl|sh) are intentional.--tool-allowlist <pattern>is repeatable and accepts comma-separated glob patterns; every tool command statement must match at least one allowlist entry when the flag is set. Allowlist entries are additive with anytool_commands_allowgraph attr and never override the denylist.--max-output-limit <bytes>sets the hard ceiling (default 10MB) applied to per-nodeoutput_limit:attrs. Node-attr and graph-attr paths remain unchanged; these flags are additive CLI surface.
tracker-swebench analyze <results-dir>subcommand (closes #141). Bulk-triage tool for completed SWE-bench runs: readspredictions.jsonl,logs/*.log, and the optional empty-patch diagnostic files from PR #150, then emits a structured report covering (1) overall resolved/unresolved/empty/error counts with percentages, (2) per-repo breakdown matching the #116 baseline table, (3) top-10 empty-patch instances with termination reason and final-message snippets from #139 diagnostics, (4) top-10 longest unresolved instances sorted by turns and elapsed time, and (5) error class distribution consuming the setup/patch/harness split from #140. Auto-detects a SWE-bench evaluator JSON report (resolved_idsfield) to distinguish resolved from unresolved; gracefully degrades to "patched but unverified" classification when no evaluator report is present. Gracefully degrades on missing empty-patch diagnostics with a one-line note pointing to the PR #150 runtime.--jsonemits the structuredAnalyzeReportfor downstream tools. Pure artifact analysis — does not require access to the SWE-bench dataset.- Typed
NodeConfigaccessors on*pipeline.Node(closes #142, #143, #144; partial #19). New methodsAgentConfig(graphAttrs),ToolConfig(),HumanConfig(),ParallelConfig(), andRetryConfig(graphAttrs)return typed structs parsed fromNode.Attrswith the graph-default-then-node-override merge centralized. Numeric parse failures are lenient (zero-value, no panic) to preserve existing permissive behavior. Three-state booleans (e.g.ReflectOnError,VerifyAfterEdit,PlanBeforeExecute,CacheToolResults) expose companion*Setflags so callers can distinguish "explicitly configured" from "absent".
- Codergen handler now consumes
AgentNodeConfiginstead of calling 8 separateapply*methods that each re-parsedNode.Attrsdirectly. Graph→node override resolution happens once in the accessor;buildConfigjust copies typed fields intoagent.SessionConfig. ReplacesapplyModelProvider,applySessionLimits,applyReasoningEffort,applyResponseFormat,applyCacheAndCompaction,applyReflectOnError,applyVerifyConfig, andapplyPlanningConfigwith a single typed consumer. No behavior change; existing codergen tests pass unchanged. Engine.maxRetriesuses the typedRetryConfigaccessor instead of duplicatingstrconv.Atoiovernode.Attrs["max_retries"]→graph.Attrs["default_max_retry"]. The fallback default (3) is unchanged.- Human, tool, and parallel handlers now consume typed configs (closes #145; finishes #19).
human.go(12 → 3node.Attrs[...]reads),tool.go(4 → 2), andparallel.go(5 → 0) route throughHumanConfig(),ToolConfig(), andParallelConfig()accessors. The remaining direct reads are semantically distinct:tool.parseTimeout/parseOutputLimitreturn errors on malformed values that the silent-default accessor can't express, and the threehuman.goholdouts (defaultvsdefault_choicedisambiguation) each have an inline comment explaining why the typed accessor's unifiedDefaultChoicecan't be used in that specific call site.parseBranchOverridesinparallel.gostill receives the fullAttrsmap by design because it scans for abranch.N.*key prefix rather than specific fields. HumanNodeConfig.DefaultChoicenow resolvesdefault_choicefirst, then falls back todefault— centralizes a two-key lookup that was duplicated across the human handler.ToolNodeConfiggainsTimeout time.Duration;ParallelNodeConfiggainsJoinID string,MaxConcurrency int,BranchTimeout time.Durationso the remaining tool and parallel reads can go through the typed accessor.- Tool node
timeoutattribute now errors when the tool node executes if set to a zero or negative duration (closes #151). This is a behavior change. Previously such values reachedcontext.WithTimeoutand caused immediate cancellation with a confusing "command timed out" error;ToolHandler.parseTimeoutnow returnsnode %q has non-positive timeout %q: must be > 0instead. Validation runs insideToolHandler.Execute(before the command is dispatched), not at workflow load time. Pipelines that wrotetimeout: "0"(unlikely but possible) will now error when the run reaches that tool node — configure a positive duration or omit the attr to use the handler default.
- Declarative
writes:/reads:unified structured output (closes #85). Agent, human, and tool nodes can now declare the keys they produce and consume. Declared writes are extracted from handler output into the pipeline context and validated — missing required fields fail the node.reads:pins fidelity for the keys a node consumes so downstream nodes see consistent data. New helpers:pipeline/context_writes.go,pipeline/handlers/declared_writes.go. Replaces node-type-specific workarounds previously needed to thread typed outputs through. tracker.SimulateGraph(ctx, graph)(closes #108) — graph-in variant ofSimulatethat accepts a pre-parsed*pipeline.Graphand returns aSimulateReport. Lets callers that already parsed the pipeline (CLI flows that also runValidateSource, tooling that builds a graph programmatically) avoid a second parse.Simulate(ctx, source)is now a thin wrapper overparsePipelineSource+SimulateGraph; signature and behavior unchanged.- Repository localization pre-processing (agent, closes #95): optional pre-processing phase that scans the working directory for files relevant to the task prompt and injects a structured context block before the first LLM turn. Pure text analysis + filesystem scan — zero LLM calls. Opt-in via
SessionConfig.Localize(defaultfalse). Extracts file paths, camelCase/snake_case identifiers, quoted phrases, and error-line excerpts from the prompt; capped at 10 files / ~2KB injected context with 5-line snippets. Reduces wasted turns onglob/grepfor repository-level tasks. - Agent episodic memory across retries/resumes (closes #96): native codergen sessions now record a structured per-tool episode log (
tool, args, success/fail, summary), publishepisode_summaryand rollingepisode_summariescontext keys at session end, and inject prior summaries into subsequent retry/resume attempts so the model can avoid repeating failed approaches. - Plan-before-execute phase (agent, closes #97): optional single planning LLM call before the main execution loop. Opt-in via
SessionConfig.PlanBeforeExecute(defaultfalse) or codergen node attrs (plan_before_execute: "true"orplan: "true"). The generated plan is retained in conversation context for subsequent execution turns. - Library API godoc, stability policy, and runnable examples (closes #110). Package-level
doc.gonow documents pre-1.0 API stability expectations; README gains a stability callout;tracker_examples_test.goships runnableExampleDiagnose/ExampleAudit/ExampleDoctorexamples that double as godoc content. - Test coverage close-out for
Diagnose/Audit/Doctor(closes #107). CoversDiagnoseMostRecent,MostRecentRunID,ResolveRunDirno-match path, corruptedstatus.jsonwarning,Auditerror paths (missing / malformed / empty run dir),Doctorwarnings sentinel, andcheckArtifactDirsnon-ENOENT stat errors.
tracker simulateoutput is now deterministic (closes #111). Graph-level attributes in the simulate header are now sorted alphabetically; orphan/unreachable nodes in the node table are appended in sorted order. Previously both depended on Go's random map iteration order, producing different diffs on each run.MostRecentRunIDno longer writes toos.Stderrfrom library paths (#107 follow-up). Parse warnings now route throughDiagnoseConfig.LogWriterso library callers aren't surprised by stray stderr.
tracker simulatenow parses the pipeline source exactly once (closes #108). PreviouslyrunSimulateCmdparsed twice — once for the validation-warnings section, again insidetracker.Simulate. That risked a TOCTOU mismatch between the two views, duplicated dippin-lang parser side effects (lint warnings printed twice), and burned extra CPU on large.dipfiles. The CLI now reads the source once, callstracker.ValidateSourcefor{Graph, Errors, Warnings}, and hands the same graph totracker.SimulateGraph. CLI stdout is byte-identical to before; only the duplicated parser-logging lines are gone.- Cost accounting and reporting are now consistent across runtime and CLI summaries (closes #128):
- CLI run summaries now read token/cost totals from
EngineResult.Usage(trace aggregate) instead ofTokenTracker.TotalUsage().EstimatedCost, so cost is shown correctly. - Repair turns now apply the same
EstimateCostcompensation path used by normal turns when providers omitEstimatedCost. - OpenAI SSE
response.completednow preservesReasoningTokensin finish usage events. - Gemini adapter now falls back to the requested model when
modelVersionis absent in API responses. - Trace usage aggregation now attributes missing providers to
unknowninstead of dropping those sessions from per-provider totals. - External backend usage tracking now records sessions with non-zero input/output tokens even when
TotalTokensis zero.
- CLI run summaries now read token/cost totals from
stack.manager_loophandler — async child-pipeline supervision (PR #126, Attractor spec 4.11). A supervisor node that launches a child pipeline in a goroutine, polls at a configurable interval, and optionally steers the running child by injecting context mid-execution. New attributes:subgraph_ref,manager.poll_interval,manager.max_cycles,manager.stop_condition,manager.steer_condition,manager.steer_context. Exposesstack.child.status/stack.child.cycles/stack.child.exit_statusto parent context. EmitsEventStageStartedon launch,EventManagerCycleTickper poll cycle, andEventStageCompleted/EventStageFailedon terminal outcomes (success, child fail, child crash, max_cycles exceeded, cancellation, stop/steer condition invalid). BoundedchildJoinGrace(30 s) protects against non-context-aware child handlers hanging the manager. Seedocs/architecture/handlers/manager-loop.md.- Engine steering channel (PR #126): new
pipeline.WithSteeringChan(chan map[string]string)engine option. Between node executions, the engine drains the channel and merges updates into the run'sPipelineContext. Used bymanager_loopto inject context into running children; available to any supervisor. Non-blocking drain; nil channel is a no-op. PipelineContext.MergeWithoutDirty(PR #126): writes updates without marking keys as dirty, so externally-injected values never leak into any node's per-node scope. Used by the engine's steering drain so injected keys stay in the global/bare namespace.- Accurate cost estimation via catalog + cache token pricing (PRs #127, #128):
EstimateCostnow resolves prices through the model catalog (GetModelInfo) instead of a duplicated hardcoded map. Adds cache token pricing: cache reads at 10% of input rate, cache writes at 25%.TokenTrackernow records the observed model per provider (AddUsagetakes an optional model arg, normalized through the catalog to matchWrapComplete) so per-provider cost estimates use the right rate sheet instead of a global fallback. - Model catalog April 2026 refresh (PR #128): adds
claude-opus-4-7,gpt-5.4-mini/gpt-5.4-nano,gpt-4.1family,o3,o4-mini, GA Gemini 2.5 models, andgemini-3.1-pro-preview(replaces the shut-downgemini-3-pro-preview). Fixesclaude-opus-4-6pricing (was incorrectly $15/$75; now $5/$25). Context windows for Sonnet/Opus 4.6 bumped to 1M.claude-sonnet/claude-opusaliases now point at the latest 4.7 entries.claude-haiku-4-5,gpt-4o, andgpt-4o-miniadded (they were in the old pricing map but not the catalog). docs/architecture/handlers/manager-loop.md: user-facing documentation for the manager-loop handler — lifecycle diagram, configuration reference, context outputs, event semantics, steering contract, and tuning guidance.tracker-swebenchnow captures the active provider base-URL override inrun_meta.json(BaseURLOverride). Derived from${PROVIDER}_BASE_URLwith hyphens normalized to underscores, so--provider openai-compatmaps toOPENAI_COMPAT_BASE_URLconsistently withResolveProviderBaseURL. Useful for reproducing SWE-bench runs that routed through a Cloudflare AI Gateway or custom endpoint.
- ACP path validation rejects
..path segments before symlink resolution (PR #126, security hardening). Previously, a symlink pointing outside the work dir plus a..in the target path could escape the sandbox: symlink resolution occurred before the check, and..in the resolved path was not filtered.validatePathInWorkDirnow splits on both/and\so Windows paths are also protected. - Manager loop: poll timer vs. child-completion race (PR #126 review): when
pollTimer.CandresultChare both ready, Go'sselectis nondeterministic. The timer path could triggermax_cyclesfailure even though the child had already finished. The timer case now does a non-blocking drain ofresultChfirst and dispatches to the child-result handler if the child is done. - Manager loop: crash path always returns a non-nil error (PR #126 review). If the child goroutine delivered neither a result nor an error, the handler synthesizes
"manager_loop: child exited with no result and no error"so callers never see(OutcomeFail, nil). - Manager loop: config validation now hard-fails on malformed values (PR #126 review).
manager.poll_intervalandmanager.max_cycleswith invalid or non-positive values now error at parse time instead of silently falling back to defaults (previously:time.ParseDurationerror swallowed, zero/negative values ignored). - Manager loop:
EvaluateConditionerrors surface for bothstop_conditionandsteer_condition(PR #126 review). A malformed expression now fails the loop with a clear error plus anEventStageFailedemission, instead of being treated as "never match" untilmax_cycles. - Manager loop: emit
EventStageFailedon context cancellation and condition-parse errors (PR #126 review). Parity with other terminal failure paths (max_cycles, child fail, child crash) so the TUI surfaces every failure mode. - Manager loop:
handleChildResultreturnsOutcomeFailon child failure (PR #126 review). Handler-level outcome values must be from the handler set (success/fail/retry); engine-level statuses likeOutcomeBudgetExceededwould have fallen through the outcome switch and been silently treated as success. The real child status remains available viapctx.Set("stack.child.exit_status", ...).
- Library API hardening for v1.0 (#102, #103, #104, #106, #109):
- Typed enum-like strings for
CheckStatusandSuggestionKindso consumers can switch-exhaust. Existing constants (SuggestionRetryPattern, etc.) retain their underlying string values. tracker.WithVersionInfo(version, commit)functional option replaces the CLI-onlyDoctorConfig.TrackerVersion/TrackerCommitfields.DiagnoseConfig.LogWriter/AuditConfig.LogWriter— optionalio.Writerfor non-fatal parse warnings. Nil is treated asio.Discardso library callers no longer see stray warnings onos.Stderr. ThetrackerCLI sets this toio.Discardfor user-facing commands.Doctorhas no warnings to suppress so it deliberately does not carry aLogWriterfield.Doctor,Diagnose,DiagnoseMostRecent,Audit,Simulatenow acceptcontext.Context, honored by provider probes and binary version lookups.getBinaryVersionnow usesexec.CommandContextwith a 5-second timeout, matchinggetDippinVersion.- Provider probe error bodies are now sanitized (API keys and bearer tokens stripped) before they land in
CheckDetail.Message. NDJSONhandler closures (pipeline, agent, LLM trace) nowrecover()from panics in the underlying writer so a misbehaving sink cannot crash the caller goroutine. Panic suppression is per-NDJSONWriterinstance (not package-level), so one misbehaving sink cannot silence unrelated writers in the same process.Diagnosenow streamsactivity.jsonlwithbufio.Scannerinstead ofos.ReadFile→strings.Split, matchingLoadActivityLogand avoiding a memory spike on large runs. Scanner errors (1 MB line-length overflow, I/O) andctx.Err()now propagate out ofDiagnoseas a real error — partial reports are never returned as success, so automation with deadlines can distinguish complete from truncated analysis.
- Typed enum-like strings for
- Workflow params via
${params.*}with CLI/library overrides (closes #81): top-level Dippinvarsnow map to graph attrs underparams.<key>, making them available in agent prompts, tool commands, and edge conditions through${params.key}interpolation. Added repeatable--param key=valueon the CLI plustracker.Config.Paramsfor library callers; overrides hard-fail on unknown keys at startup and run summaries print effective overridden params. New lint rules DIP120 (undeclared${params.*}reference) and DIP121 (declared but unused var). - Per-human-gate timeout / timeout_action in
.dip(closes #112): the dippin-lang v0.21.0 IR exposesHumanConfig.TimeoutandHumanConfig.TimeoutAction; the adapter copies them intonode.Attrs["timeout"]/node.Attrs["timeout_action"]wherepipeline/handlers/human.goalready consumed them. Theexamples/human_gate_test_suite.dipMakefile lint skip is removed. - Workflow-level budget ceilings from
.dip(closes #67): dippin-lang v0.21.0 addsWorkflowDefaults.MaxTotalTokens,WorkflowDefaults.MaxCostCents, andWorkflowDefaults.MaxWallTime. The adapter now maps them tograph.Attrs["max_total_tokens"]/["max_cost_cents"]/["max_wall_time"], andtracker.ResolveBudgetLimitsuses them as a fallback whenConfig.Budgetand the matching--max-*CLI flags are zero. Explicit config values still win. Wired through both the library engine builder and the CLI's console/TUI engine builders. - TUI pre-populates subgraph children in the sidebar (closes #118): subgraph reference nodes previously appeared as opaque single rows until child
stage_startedevents arrived.buildNodeListnow accepts thesubgraphsmap and recursively flattens child graphs with prefixed IDs (Parent/Child/...), preserving user-set labels and parallel/fan-in flags. Lazy insertion remains as a fallback with a cycle guard for self-referential subgraph maps. - Agent quality-of-life improvements from SWE-bench work:
- Turn-budget checkpoints: optional guidance messages injected at configurable fractions of the turn budget (50%, 75%) to reduce thrashing on hard instances.
- Two-phase verify-after-edit: focused test first, broad regression test second, with a configurable repair retry budget. Models the pattern top SWE-bench agents use.
- Tool polish:
grepgets context lines, noise-dir filtering, and truncated-match count;readgetsoffset/limitfor paged access;editshows nearby context on a miss. - Process safety: tool subprocess groups are killed after the shell command completes, preventing orphan zombies on timeouts.
- SWE-bench harness: agent event logging + transcript capture; checkpoint and verify config threaded into
agent-runner. - Config defaults promoted:
DefaultConfig()now usesMaxTokens: 16384, auto-continue on truncation, andLoopDetectionThreshold: 4— values measured effective in SWE-bench Lite (59.0% → 70.3% baseline shift). - New CLI flag
--artifact-diroverrides the node state directory.
- dippin-lang dependency bumped from
v0.20.0→v0.21.0. Picks up three upstream fixes tracked as dippin-lang#18/#20/#21 (PRs #22/#23) plus release issue #25.PinnedDippinVersionconstant updated to match. Closes tracker#75 transitively — dippin lint now recognizes${ctx.node.<id>.*}scoped reads as valid without tracker-side changes. - BREAKING (library):
tracker.Doctor(cfg)→tracker.Doctor(ctx, cfg, opts...).tracker.Diagnose(runDir)→tracker.Diagnose(ctx, runDir, opts...).tracker.DiagnoseMostRecent(workdir)→tracker.DiagnoseMostRecent(ctx, workdir, opts...).tracker.Audit(runDir)→tracker.Audit(ctx, runDir). (No config struct — Audit emits no suppressible warnings. UseListRuns+AuditConfig{LogWriter}for bulk enumeration.)tracker.Simulate(source)→tracker.Simulate(ctx, source).tracker.ListRuns(workdir)now accepts optional...AuditConfig.tracker.NDJSONEvent→tracker.StreamEvent. Wire-format JSON tags unchanged.NDJSONWriter.Writenow returnserrorso callers can detect a broken stream. First failure is still logged toos.Stderronce (unchanged behavior); subsequent failures are surfaced via the return value.DoctorConfig.TrackerVersionandDoctorConfig.TrackerCommitremoved — usetracker.WithVersionInfo(version, commit)instead.CheckResult.StatusandCheckDetail.Statusare now typed astracker.CheckStatus(underlying string). Untyped string literal comparisons (status == "ok") keep working.Suggestion.Kindis now typed astracker.SuggestionKind(underlying string).
tracker diagnosesuggestion order is now deterministic (alphabetical by node ID). Previously suggestions printed in Go map-iteration order, which varied between runs.
- OpenAI Responses API:
function_call_outputandfunction_callitems now always serialize required fields (closes #114). Previously the sharedopenaiInputstruct usedomitemptyon every field, so a tool returning an empty-string result produced{"type":"function_call_output","call_id":"..."}with nooutputfield, and a no-argument tool call producedfunction_callwith noarguments. OpenAI's endpoint tolerated this, but OpenRouter's strict Zod validator rejected the requests withinvalid_prompt/invalid_unionerrors, symptomatic on GLM, Qwen, and Kimi via OpenRouter. Fixed by replacing theomitempty-tagged single struct with aMarshalJSONmethod that emits only fields valid per item type, with required fields always present. Reported by @Nopik.
-
CLI↔library feature parity — Phase 1 (NDJSON) + Phase 2 (#76, PR #101). Four CLI commands (
diagnose,audit,doctor,simulate) and the NDJSON event writer are now public library APIs. Library consumers can reuse the CLI's behavior without shelling to a binary and parsing printed output.tracker.NewNDJSONWriter(io.Writer)— public NDJSON event writer producing the same wire format astracker --json. Factory methodsPipelineHandler,AgentHandler,TraceObserverreturn handlers that plug intoConfig.EventHandler,Config.AgentEvents, and the LLM trace hook. Closes Phase 1.tracker.Diagnose(runDir)/tracker.DiagnoseMostRecent(workDir)— structured*DiagnoseReportwith node failures, budget halt, and typed suggestions (Kind: "retry_pattern" | "escalate_limit" | "no_output" | "shell_command" | "go_test" | "suspicious_timing" | "budget").tracker.Audit(runDir)— structured*AuditReportwith timeline, retries, errors, and recommendations.tracker.ListRuns(workDir)— sorted[]RunSummaryfor enumerating past runs (newest first).tracker.Doctor(cfg)— structured*DoctorReportfor preflight health checks.ProbeProvidersdefaults to false; set true to make real API calls for auth verification.CheckDetail.Statushas four values:"ok","warn","error", and"hint"(informational sub-items such as optional providers not configured).tracker.PinnedDippinVersion— exported constant exposing the dippin-lang version pinned ingo.mod.tracker.Simulate(source)— structured*SimulateReportwith nodes, edges, execution plan, graph attributes, and unreachable-node list.tracker.ResolveRunDir(workDir, runID)/tracker.MostRecentRunID(workDir)— exposed run-directory resolution helpers.tracker.ActivityEntry/tracker.LoadActivityLog(runDir)/tracker.ParseActivityLine(line)/tracker.SortActivityByTime(entries)— shared activity.jsonl parsing used by CLI and library.
-
SWE-bench harness (
cmd/tracker-swebench): a new orchestrator binary that evaluates tracker's agent against the SWE-bench dataset. Includes a Dockerfile and build script for the base image, container lifecycle management with SIGTERM handling and orphan cleanup, dataset JSONL parsing, results writer with resumability, container resource limits (CPU/memory) and--platformpinning, secure--env-filefor API keys (replacing-eflags), instance-ID validation + scoped container names, integration test for the dataset-to-results pipeline, and an in-containeragent-runnerbinary that captures all changes viagit diff(including new files). -
WithExtraHeadersoption for Anthropic and OpenAI adapters: injects custom HTTP headers (e.g.,cf-aig-token) for gateway auth. Used by the swebench harness to forwardCF_AIG_TOKENfrom the host through the container to the agent-runner.
classifyStatusnow correctly returns"fail"for budget-halted runs (runs with abudget_exceededactivity event were previously mis-classified as"success").NDJSONWriter.AgentHandlernow preserves the originalagent.Event.Timestampinstead of re-stamping withtime.Now(), preventing event reordering in the NDJSON stream.simBFSNodeOrdernow sorts orphan nodes by ID before appending, makingSimulateReport.Nodesordering deterministic.ResolveRunDirnow always returns an absolute path viafilepath.Abs, matching its documented contract.MostRecentRunIDno longer writes toos.Stderrfrom a library function; invalid checkpoint directories are silently skipped.checkWorkdirLibnow correctly propagateswarndetails to the section-levelStatusfield.checkProvidersLibnow propagates individual providererrordetails to the section-levelStatus(was always"ok"when any provider was configured).getDippinVersionnow usesexec.CommandContextwith a 5-second timeout to prevent hangs on unresponsive dippin binaries.PinnedDippinVersionconstant updated tov0.20.0to match thego.modrequirement.checkPipelineFileLibno longer warns when the pipeline file has a.dotextension (both.dipand.dotare valid input formats).- Fixed ineffectual assignment to
suffixincmd/tracker/doctor.gomaybeFixGitignore. checkDiskSpaceLibmoved to platform-specific files (tracker_doctor_unix.go/tracker_doctor_windows.go) to avoid a Windows build failure fromsyscall.Statfs.enrichFromEntryNFandupdateFailureTimingNFnow guard against zero timestamps to prevent incorrect duration calculations inDiagnoseReport.claude-sonnet-4-6added to the LLM model catalog — the model was inpricing.gobut missing fromcatalog.go, causingGetModelInfoto return nil and cost reporting to show$0.00for the swebench harness default model.- ACP backend:
validatePathInWorkDirnow resolves symlinks on bothpathandworkDir. On macOS/varis a symlink to/private/var, which was causing path validation to reject files insidet.TempDir().
-
cmd/tracker/diagnose.go,audit.go,doctor.go,simulate.goare now thin printers over the new library APIs. CLI stdout and--jsonwire format are byte-identical. Closes Phase 2 of #76. -
dippin-langdependency bumped fromv0.19.1→v0.20.0. CI installs the matching CLI version (was stale atv0.10.0).examples/human_gate_test_suite.diprenameddefault_choice:→default:to match the IR field. The file is temporarily skipped frommake lintbecause v0.20.0's stricter parser rejectstimeout:/timeout_action:on human nodes — tracker supports those attrs at the node level but dippin-lang'sHumanConfigIR doesn't expose them yet. Tracked upstream at dippin-lang#18. -
Structured reflection prompt on tool failure (issue #93): when a tool call returns an error, the agent session now automatically injects a user-role reflection message before the next LLM turn. The prompt asks the model to identify what went wrong, what assumption was incorrect, and what minimal change will fix it — matching the pattern used by top SWE-bench agents (~10-15% recovery improvement). The feature is enabled by default (
ReflectOnError: trueinDefaultConfig()) and capped at three consecutive reflection turns to prevent infinite loops; the counter resets after any clean (no-error) turn. Pipeline authors can opt individual nodes out viareflect_on_error: falsein their.dipfile. -
Verify-after-edit loop with auto-test (closes #94): agent sessions can now automatically run tests after any turn that includes file-edit tool calls (
write,edit,apply_patch,notebook_edit). Modelled on top SWE-bench agent behaviour (~15-20% improvement on benchmark), this transparent inner loop catches regressions before the LLM moves on.SessionConfig.VerifyAfterEdit bool— opt-in flag (default: false).SessionConfig.VerifyCommand string— explicit command; if empty, auto-detection runs:go.mod→go test ./...,Cargo.toml→cargo test,package.json→npm test,Makefilewithtest:target →make test,pytest.ini/pyproject.toml[tool.pytest]→pytest.SessionConfig.MaxVerifyRetries int— max verify→repair cycles per edit turn (default: 2). After exhaustion the session proceeds without blocking.- Repair turns do NOT count toward
MaxTurns— they are a transparent sub-loop. - Verification output is capped at 4 KB (tail kept — most relevant errors appear at the end).
- Pipeline nodes wire the feature via
verify_after_edit,verify_command, andmax_verify_retriesnode attributes.verify_commandcan also be set at graph level as a default for all nodes. - New file
agent/verify.go; 8 new tests inagent/verify_test.goandagent/session_test.go.
-
Library API for workflow catalog and resolution (partial #76 — Phase 1): library consumers can now list, open, and resolve built-in workflows without shelling out to the CLI.
tracker.Workflows() []WorkflowInforeturns every embedded workflow sorted by name.tracker.LookupWorkflow(name) (WorkflowInfo, bool)looks up a single built-in by bare name.tracker.OpenWorkflow(name) ([]byte, WorkflowInfo, error)returns the raw.dipsource for a built-in.tracker.ResolveSource(name, workDir) (source, WorkflowInfo, error)mirrors the CLI's bare-name resolution — filesystem first, then embedded — and returns the actual source bytes.tracker.ResolveCheckpoint(workDir, runID) (path, error)resolves a run ID (or unique prefix) to itscheckpoint.jsonpath under.tracker/runs/<runID>/.tracker.Config.ResumeRunIDlets library consumers setcfg.ResumeRunID = "abc123"andNewEngineresolves it toCheckpointDirautomatically — equivalent to the CLI's-r/--resumeflag. An explicitCheckpointDiron the same config still wins as a manual override.- Embedded workflow files moved from
cmd/tracker/workflows/to top-levelworkflows/so they can be shared by both the tracker library and the CLI binary. The CLI continues to embed them via thin wrappers over the library functions.
-
ExportBundle(runDir, outPath string) errorlibrary API and--export-bundleCLI flag (issue #77, Layer 2): after a run completes,ExportBundlecallsgit bundle create <outPath> --allagainst the artifact run directory to produce a single portable.bundlefile capturing every commit and tag (includingcheckpoint/*tags) produced byWithGitArtifacts. The bundle can be cloned on any machine withgit clone <bundle>and inspected withgit log.Result.ArtifactRunDiris now populated whenConfig.ArtifactDiris set, giving callers a direct path to the run directory.Result.BundlePathis available for callers that populate it after callingExportBundle. The CLI--export-bundle <path>flag invokesExportBundleas a post-run step; failures print a warning and do not affect the run's exit code. No new dependencies — implemented withos/exec. -
WithGitArtifacts(bool)engine option (issue #77, Layer 1): when enabled alongsideWithArtifactDir, the artifact run directory is initialized as a (non-bare) git repository at run start and a commit is created after every terminal-outcome node — including success, fail, retry-exhausted, goal-gate fallback, and goal-gate unsatisfied paths. Commits carry a structured message (node(<id>): <handler> outcome=<status>) plus duration, edge, and token/cost metadata.git loggives a human-readable audit trail of execution order. Successful node advances also create lightweight checkpoint tags (checkpoint/<runID>/<nodeID>) enabling future replay support. On checkpoint resume,Init()detects an existing HEAD and skips the "run started" commit so replay doesn't add noise. All git operations are best-effort — git failures emitEventWarningevents and do not crash the engine. Requiresgitin PATH; silently no-ops ifartifactDiris unset or git is missing.
-
tracker doctorrobustness fixes (PR #83 review round 2):- Writability probes now use
os.CreateTempinstead of fixed filenames (.tracker_test_write,.tracker_write_probe) — probes can't collide with real user files and are always cleaned up. checkProvidersno longer emits ✗ lines for unconfigured providers when at least one provider is already configured. Missing providers are shown as an informational hint line (e.g. "not configured: OpenAI, Gemini (optional)"). The ✗ lines appear only when zero providers are configured.checkGitignoreparses the.gitignorefile line-by-line with exact (trimmed, slash-stripped) comparison instead ofstrings.Containsto prevent false positives (runsheet→runs,my.tracker.bak→.tracker).- Removed spurious
TRACKER_ARTIFACT_DIRcheck — that env var is not wired into any CLI code path; checking it was misleading. - Disk space threshold confirmed at 10 GB (was already correct in code and CHANGELOG; the initial PR description saying 100 MB was wrong and has been corrected).
resolveProviderBaseURLindoctor.gowas a duplicate of the canonical function. The duplicate is removed;doctor.gonow calls the exportedtracker.ResolveProviderBaseURL. The Gemini gateway suffix is corrected to/google-ai-studio(was/gemini).parseDoctorFlagsnow validates--backendagainst the allowed set (native,claude-code,acp), consistent withparseRunFlags.
- Writability probes now use
-
Per-node backend selection now overrides global
--backendflag (issue #70): A node withbackend: nativealways uses the native LLM client even when--backend claude-codeis set globally, enabling mixed-backend pipelines (e.g. some nodes on claude-code subscription, others on OpenAI native API). TheselectBackendpriority is now documented: per-node attr > global flag > default native. The registry also registers the CodergenHandler when per-node backend attrs are present in the graph, even if the global default is native and no--backendflag is passed. Error messages for missing native client when using--backend claude-codenow include actionable guidance. -
Start/exit node handler overwrite broadened fix:
ensureStartExitNodespreviously checked only thepromptattribute to decide whether to preserve a node's handler, which meant tool nodes (tool_command) and human nodes (mode) designated as start/exit would still have their handlers silently overwritten. The helper now bases the decision on the resolvedHandlerfield: any handler other thancodergenis always preserved; only a barecodergennode with nopromptgets the passthrough. This fixes cases likeparallelwithparallel_targets,parallel.fan_inwithfan_in_sources,conditional,subgraph,stack.manager_loop, andwait.humannodes used as start/exit. Closes #69.
-
Cloudflare AI Gateway support (
TRACKER_GATEWAY_URLenv var,--gateway-urlCLI flag): set one gateway root URL and tracker routes every provider through Cloudflare's AI Gateway — Anthropic, OpenAI, Gemini, OpenAI-compat — avoiding 429 rate limits and enabling gateway-side analytics, caching, and model routing. The newResolveProviderBaseURL(provider)helper resolves the per-provider base URL with priority<PROVIDER>_BASE_URL>TRACKER_GATEWAY_URL+ provider suffix > empty (SDK default), so per-provider env var overrides still work. Closes #64. -
tracker doctorcomprehensive preflight checks (closes #61):tracker doctornow runs a structured series of checks with clear pass/warn/fail status, actionable fix messages, and documented exit codes (0=all pass, 1=any failure, 2=warnings only). New checks include:- Per-provider API key validation with format hints (key prefix, length)
--probeflag for live auth validation (makes a minimal 1-token API call per configured provider; offline-safe by default). The probe adapters honor<PROVIDER>_BASE_URLenv vars (andTRACKER_GATEWAY_URL) so probing through a Cloudflare gateway works.dippinbinary version detection;checkVersionCompatcompares the installed CLI's major.minor against thego.mod-pinned version (v0.18.0) and warns on divergence..ai/subdirectory writability check (note:TRACKER_ARTIFACT_DIRenv var is not checked — it is not wired into the CLI and was removed to avoid misleading output)- Disk space warning (warn if < 10 GB free — threshold confirmed in code; the initial PR description that said 100 MB was incorrect)
.gitignorecheck for.tracker/,runs/, and.ai/entries (line-by-line exact match — no more false positives from substrings likerunsheet)- Environment variable warnings for dangerous override keys (
TRACKER_PASS_ENV,TRACKER_PASS_API_KEYS) --backend claude-codeawareness: hard-fails (exit 1) if theclaudeCLI is not found; without--backendthe missing binary is a warning only.tracker doctor [pipeline.dip]: optional positional arg validates the pipeline file with full lint (same astracker validate)- Human-readable composite result lines per check group (providers, binaries, dirs)
-w/--workdirand--backendflags ontracker doctorsotracker -w /path doctorandtracker --backend claude-code doctorwork as expected.- OpenAI-Compat provider now has a real
--probeimplementation (previously silently skipped). - Probe default models updated to current catalog entries: Anthropic →
claude-haiku-4-5, Gemini →gemini-2.0-flash. - Exit code 2 is emitted when doctor finishes with warnings but no hard failures (was always 0).
DoctorWarningsErrorsentinel returned fromrunDoctorWithConfig;main.gomaps it toos.Exit(2).
-
Webhook-based human gates for headless execution (Closes #63, Closes #86): new
tracker.Config.WebhookGatelibrary field and matching CLI flags wire aWebhookInterviewerthat POSTs gate prompts to a user-configured webhook URL and blocks on a callback. The interviewer starts a local HTTP server on a configurable address, tracks pending gates by UUID with per-gate shared-secret tokens (X-Tracker-Gate-Token) to authenticate inbound callbacks (mismatches return 401), supports a per-gate timeout with configurable action (fail/success), optionalAuthorizationheader for outbound requests, server-side HTTP timeouts (ReadHeaderTimeout10s /ReadTimeout30s /WriteTimeout30s /IdleTimeout60s), 64 KB callback body cap viahttp.MaxBytesReader, wildcard-address rewrite (0.0.0.0/[::]→127.0.0.1) so the outbound payload carries a dialable callback URL, and an explicitCancel()that closes the server and unblocks pending gates. Implements bothFreeformInterviewerandLabeledFreeformInterviewerso it drops into existing pipeline flows unchanged. CLI flags added:--webhook-url(required to enable),--gate-callback-addr(default:8789),--gate-timeout(default10m),--gate-timeout-action(fail/success),--webhook-auth(outboundAuthorizationheader). Mutual exclusion with--autopilotand--auto-approveis enforced at parse time. Validation rejects invalid--gate-timeout-actionvalues at parse time. -
Per-node context scoping (
PipelineContext.ScopeToNode): after each node's handler completes, the engine copies every key written during that node's execution into anode.<nodeID>.<key>namespace. Downstream nodes can readnode.MyAgent.last_responseto get a specific upstream node's output without being affected by later writes to the barelast_responsekey. Bare keys retain their last-writer-wins global semantics for full backward compatibility. New convenience methodGetScoped(nodeID, key). Closes #32. -
pipeline.ContextKeyNodePrefixconstant ("node."), the namespace prefix for per-node scoped keys. -
Result.Coston the library API with per-provider rollup (map[string]llm.ProviderCost) andTotalUSD. Populated from thellm.TokenTrackermiddleware and priced viallm.EstimateCost. Closes #62. -
pipeline.BudgetGuardenforcingMaxTotalTokens,MaxCostCents, andMaxWallTimelimits. Halts the run withpipeline.OutcomeBudgetExceededwhen any dimension trips. Closes #17. -
New
tracker.Config.Budgetfield (typepipeline.BudgetLimits) for library consumers. -
New CLI flags on
tracker run:--max-tokens,--max-cost(cents),--max-wall-time. -
New pipeline events
cost_updated(streaming per-node cost snapshots) andbudget_exceeded(fired on halt). Both carry aCostSnapshotpayload withTotalTokens,TotalCostUSD,ProviderTotals, andWallElapsed. -
tracker diagnosesurfaces a "Budget halt detected" section when a run halts on budget. -
UsageSummary.ProviderTotals(per-provider token and cost rollup) onpipeline.Trace.AggregateUsage()output.
- Reading budget limits from
.dipworkflow attrs is blocked on dippin-lang IR support; tracked in #67.
- Turn-limit exhaustion treated as success: Agents that exhausted their turn limit (or entered a tool call loop) were silently treated as
OutcomeSuccess, allowing pipelines to advance past nodes that wrote zero files. Now returnsOutcomeFailso the engine routes through explicitwhen ctx.outcome = failedges (or stops via strict-failure-edge when no failure edge exists). - Loop detection produces distinct diagnostic:
turn_limit_msgcontext key now distinguishes "agent entered tool call loop" from "agent exhausted turn limit" for clearertracker diagnoseoutput.
ContextKeyTurnLimitMsgconstant: Newpipeline.ContextKeyTurnLimitMsgcontext key for turn-limit and loop-detection diagnostics. Added toreservedContextKeys()for linter recognition.- Turn-limit and loop-detection tests:
TestCodergenHandlerMaxTurnsExhaustedIsFail,TestCodergenHandlerMaxTurnsWithAutoStatusSuccess,TestCodergenHandlerMaxTurnsWithAutoStatusFail,TestCodergenHandlerLoopDetectedMessage.
- Thinking signature dropped in streaming: The Anthropic SSE handler now captures
signature_deltaevents. Previously, thinking block signatures were silently lost during streaming, causing multi-turn sessions with extended thinking (Opus 4.6) to crash withmessages.N.content: Input should be a valid listwhen the API rejected the signature-less thinking block on the next turn. - Redacted thinking blocks dropped in streaming: The SSE handler now captures
redacted_thinkingcontent blocks and round-trips them through theStreamAccumulator. Previously, these opaque blocks were silently dropped, breaking conversation continuity. - Nil message content serialized as
null:translateMessagenow initializes content as an empty slice so JSON serializes to[]instead ofnullwhen all content parts are skipped.
- Comprehensive human gate test suite:
examples/human_gate_test_suite.dipexercises all 4 gate modes (choice, yes_no, freeform, interview) plus timeout, default_choice, ctx.outcome routing, hybrid freeform, and interview cancel. 100 simulated paths, all reaching Exit. - Backend selection precedence test: Verifies node attr overrides global
--backendCLI flag.
- dippin-lang v0.18.0: Updated from v0.17.0. Adds
flattenpackage for inlining subgraph refs into a single flat workflow.
- human_gate_showcase.dip: EchoFreeform agent no longer asks follow-up questions that conflict with the next gate's choices.
mode: yes_nohuman gate outcome mapping: Yes now returnsOutcomeSuccess, No returnsOutcomeFail. Previously,yes_nofell through to choice mode which always returnedOutcomeSuccessregardless of selection, causingctx.outcome = failconditions to never match. Pipelines usingmode: yes_nowithctx.outcomeedge conditions now route correctly.
executeYesNohandler: Dedicated handler formode: yes_nohuman gates. Presents fixed "Yes"/"No" choices and maps selection to outcome status. Comprehensive test coverage for all four human gate modes (choice, yes_no, freeform, interview).
- ACP (Agent Client Protocol) backend: Third execution backend alongside native and claude-code. Spawns ACP-compatible coding agents as subprocesses via JSON-RPC 2.0 over stdio using
github.com/coder/acp-go-sdk. Per-node selection viabackend: acp+acp_agentparams in .dip files, global override via--backend acpCLI flag. - ACP agent routing: Provider-based binary mapping (
anthropic→claude-agent-acp,openai→codex-acp,gemini→gemini --acp). Theacp_agentnode attribute overrides provider-based selection. - ACP model bridging:
mapModelToBridgemaps tracker model names (e.g.claude-sonnet-4-6) to bridge model IDs via substring matching againstNewSessionadvertised models. - ACP environment scoping: API keys and base URLs stripped from subprocess environment by default so agents use native auth (subscription/OAuth). Override with
TRACKER_PASS_API_KEYS=1. - ACP terminal management: Full
CreateTerminal,TerminalOutput,KillTerminalCommand,ReleaseTerminalimplementation with process group isolation (Setpgid) and goroutine-safe output buffering. - ACP file operations:
ReadTextFileandWriteTextFilehandlers scoped to the node's working directory. ACPConfigtype: Backend-specific config carrying explicit agent binary name, extracted fromparams.acp_agentin .dip files.--backend acpCLI flag: Routes all agent nodes through ACP without per-node attrs.
- ACP data race on empty response check:
handler.munow locked before readingtextParts/toolCountafter prompt completes. - ACP terminal output data race: Replaced
bytes.BufferwithsyncBuffer(mutex-protected writer) for subprocess stdout/stderr. - ACP protocol version validation:
InitializeResponse.ProtocolVersionchecked againstProtocolVersionNumberwith warning on mismatch. - ACP empty Cwd fallback:
os.Getwd()used whenWorkingDiris empty, preventing ACP SDK validation failure. - ACP process kill safety:
Pid > 0guard beforesyscall.Kill(-pid, SIGKILL)at all 3 call sites to prevent killing pid 0 process group. TRACKER_PASS_API_KEYStruthiness: Changed from!= ""to== "1"so"false"and"0"correctly strip keys.
- Per-node response context keys: Codergen and human handlers now write
response.<nodeID>alongsidelast_response/human_response, enabling downstream nodes to reference specific upstream outputs instead of only the most recent. (#24) - Parallel concurrency limits:
max_concurrencyattr on parallel nodes limits concurrent branch goroutines via semaphore. Context-aware acquisition aborts on cancellation. (#27) - Parallel branch timeout:
branch_timeoutattr on parallel nodes sets per-branch context deadline. Slow branches fail without blocking fan-in. (#27) - Human gate timeout:
timeoutattr on human nodes withtimeout_action(default/fail) anddefault_choicefallback. Applied to freeform, choice, and interview modes. (#30) - Edge adjacency indexes:
OutgoingEdges/IncomingEdgesnow use O(1) map lookup via adjacency indexes built byAddEdge, with O(E) fallback for graphs built withoutAddEdge. Returns defensive copies. (#31) - Format constants:
FormatDipandFormatDOTtyped constants for pipeline format identification. (#9) - Pipeline package documentation:
pipeline/doc.gowith package overview and dual-format documentation. (#12)
- P0: Goal-gate infinite fallback loop:
FallbackTakenguard persisted in checkpoint prevents one-shot fallback/escalation from looping. Separate fallback routing path inhandleExitNodedoesn't increment retry counts. (#15) - P0: Parallel branch context loss on fan-in:
PipelineContext.DiffFrom()captures side effects from parallel branches. (#20) - Adapter nil pointer guards: Nil checks for IR nodes, edges, and all 6 pointer config types in
extractNodeAttrs. Also guards insynthesizeImplicitEdgesandbuildFanInSourceMap. (#38) - Adapter sentinel errors:
ErrNilWorkflow,ErrMissingStart,ErrMissingExit,ErrUnknownNodeKind,ErrUnknownConfigwith%wwrapping forerrors.Issupport. (#33) - Deterministic map iteration:
extractSubgraphAttrsandserializeStylesheetsort keys before iteration viaslices.Sorted(maps.Keys(...)). (#8) - Workflow.Version mapping:
ir.Workflow.Versionnow mapped tog.Attrs["version"]. (#25) - Validation bypass removed: Deleted
DippinValidatedfield — all 5 structural validation checks always run for defense-in-depth. (#4) - Library stderr cleanup: Replaced
fmt.Fprintf(os.Stderr, ...)withlog.Printf(...)in library code (tracker.go, condition.go, autopilot handlers). (#7) - Case-insensitive auto_status:
parseAutoStatusnow matches STATUS prefix case-insensitively and skips STATUS lines inside code fences. (#23) - Word-boundary fidelity truncation:
truncateAtWordBoundarycuts at whitespace (unicode.IsSpace) instead of mid-word, with...suffix and namedDefaultTruncateLimitconstant. (#34) - Condition parser hardening: Support
==operator (space-delimited), strip surrounding double quotes from values in=/==/!=comparisons. (#21) - Consensus pipeline parallelized:
consensus_task.dipnow uses parallel fan-out/fan-in for DoD, Planning, and Review phases. (#26) - CLI format detection default: Unknown extensions now default to
.dipinstead of.dot, with case-insensitive extension matching. (#9) - Empty API response retry: Empty API responses (0 output tokens, 0 tool calls) now trigger
OutcomeRetryinstead of hard-failing. (#23) - POSIX build constraint:
//go:build !windowsonagent/exec/local.go. (#28) - ConsoleInterviewer IsYesNo priority: Yes/no check now runs before option list check, matching TUI behavior. (#48 review)
- Test rename:
TestListBuiltinWorkflowsReturnsThree→ReturnsFour. (#48 review)
- Retry backoff jitter:
ExponentialBackoffandLinearBackoffnow apply ±25% random jitter to prevent thundering herd when multiple pipelines retry simultaneously. (#29) - Code cleanup: Unexported
NodeKindToShape, removedmake([]*Edge, 0), replaced customcontainshelper withstrings.Contains, replaced bubble sort withslices.SortFunc. (#10)
- DOT format support:
ParseDOTis deprecated. Use.dipformat withFromDippinIRinstead. DOT support will be removed in v1.0. (#12)
- Interview mode for human gates: New
mode: interviewon human nodes enables structured multi-field form collection. An upstream agent generates markdown questions; the interview handler parses them into individual fields (select with inline options, yes/no confirm, freeform textarea). Answers are stored as JSON at a configurable context key and as a markdown summary athuman_response. Supports retry pre-fill, cancellation with partial answers, and 0-question fallback to freeform. - Interview question parser:
ParseQuestions()extracts structured questions from agent markdown — numbered items, bulleted questions, imperative prompts. Trailing parentheticals like(option1, option2)become select field options. Yes/no patterns auto-detected. Fenced code blocks skipped. - TUI interview modal: Fullscreen one-question-at-a-time form with progress bar, answered summary, selection feedback (filled dot + checkmark), elaboration textareas (Tab), submit (Ctrl+S), cancel (Esc), and PgUp/PgDn jump navigation. Pre-fills from previous answers on retry.
- Interview autopilot support:
AutopilotInterviewer,ClaudeCodeAutopilotInterviewer, andAutopilotTUIInterviewerall implementAskInterview. LLM-backed autopilot sends all questions in a single prompt, parses JSON response, retries once on parse failure, hard-fails on double failure. - Console interview support:
ConsoleInterviewer.AskInterviewpresents questions one at a time with option selection by name or number, blank-line skip, and previous-answer hints on retry. deep_reviewbuilt-in workflow: Interview-driven codebase review pipeline with 3 structured interview gates (scope, findings, priority), parallel analysis (correctness, security, design), and remediation plan generation. Run withtracker deep_review.interview-loop.dipsubgraph: Reusable interview loop pattern (ask → answer → assess → loop) inexamples/subgraphs/. Parameterized withtopicandfocusfor embedding viasubgraphnodes.- Structured JSON question format:
ParseStructuredQuestions()parses JSON questions from agent output with validation. Handles code fences, preamble text, and extracts{"questions": [...]}objects. Falls back to markdown heuristic parsing. "Other" option variants are auto-filtered since the UI always provides its own. - One-question-at-a-time TUI: Interview form shows one question with full context, progress bar, answered summary, and remaining count. Selection feedback with filled dot and checkmark. Enter confirms and advances.
response_formatsupport: Agent nodes can setresponse_format: json_objectorresponse_format: json_schemawithresponse_schema:to force structured output at the LLM API level. Plumbed from.dipfiles through dippin IR → adapter → codergen → agent session → all three providers (Anthropic, OpenAI, Gemini).- Agent
paramsmap: Generic key-value pass-through from.dipfiles viaAgentConfig.Params(dippin-lang v0.16.0). Enables runtime features likebackend: claude-codewithout IR schema changes. - Empty API response diagnostics: Anthropic adapter logs raw response body, HTTP status, stop_reason, model, and request-id when API returns 0 output tokens. Session layer retries completely empty responses with diagnostic event emission.
- EngineResult.Usage: Pipeline runs now expose aggregated token counts and cost via
EngineResult.Usage(*UsageSummary). Downstream consumers can readTotalInputTokens,TotalOutputTokens,TotalTokens,TotalCostUSD, andSessionCountdirectly from the result. - Per-node token tracking in SessionStats:
InputTokens,OutputTokens,TotalTokens,CostUSD,ReasoningTokens,CacheReadTokens,CacheWriteTokensfields onSessionStatsin trace entries. - Parallel branch stats aggregation: Parallel handler now collects and aggregates
SessionStatsfrom branch outcomes into its own trace entry. - Consistent JSON tags: All fields on
SessionStats,TraceEntry, andTracenow havejson:"snake_case"tags for consistent serialization.
- Interview cancellation returns OutcomeFail: Canceled interviews now return
failstatus instead ofsuccess, allowing pipeline edges to route canceled interviews differently from completed ones. - ClaudeCode autopilot hard-fails on parse error:
ClaudeCodeAutopilotInterviewer.AskInterviewnow retries once on JSON parse failure and hard-fails on double failure, matching the native autopilot behavior. Previously silently fell back to first-option defaults. - SerializeInterviewResult enforced: Panics on marshal failure instead of silently returning empty string, preventing downstream deserialization corruption.
- Goroutine leak in autopilot flash:
flashDecisiongoroutine now exits immediately when the caller unblocks via adonechannel, instead of sleeping for the full 2-second timer. Includesdefer/recoverfor panic safety per CLAUDE.md. - Mode 1 tea.Cmd propagation: All three TUI runner types (choice, freeform, interview) now propagate
tea.Cmdfromcontent.Update()instead of discarding it. - Context leak in retry loop:
ClaudeCodeAutopilotInterviewer.AskInterviewuses explicitcancel()calls instead ofdefer cancel()inside a for loop, preventing context timer goroutine leaks on retry. - Empty API response guard: Agent sessions that receive completely empty responses (0 content parts, 0 output tokens, no prior tool calls) now retry with a continuation prompt instead of silently succeeding with empty
last_response. Codergen handler also fails the node when the session produces empty text with zero tool calls. - Start/exit agent nodes preserved:
ensureStartExitNodesno longer overwrites thecodergenhandler on agent nodes designated as start or exit. Agent start/exit nodes now execute their LLM prompts instead of being silently replaced with no-op passthroughs. (Closes #42) - DecisionDetail token mapping:
TokenInput/TokenOutputin pipeline events now correctly map fromInputTokens/OutputTokensinstead ofCacheHits/CacheMisses. - Native backend double-counting: Token usage from the native backend is no longer reported twice to the
TokenTracker. - Cancel/fail EndTime: Cancelled and retry-exhausted runs now set
trace.EndTimeso the run summary shows duration. - failResult atomicity:
failResult()now accepts a*Traceparameter and sets bothTraceandUsageinternally, preventing silent data loss. - Built-in pipeline prompts: Removed trivial placeholder prompts from Start/Done nodes in built-in workflows that were causing unnecessary LLM calls.
- TUI: Progress bar with ETA: Amber ASCII bar (
━━━──────) in the status bar shows completed/total nodes. ETA appears after 2+ real LLM nodes complete, based on rolling average of node durations. - TUI: Desktop notification: Fires OS-native notification on pipeline completion (macOS
osascript, Linuxnotify-send). Disable withTRACKER_NO_NOTIFY=1. - TUI: Log verbosity cycling (
v): Cycle through All → Tools → Errors → Reasoning. View-level filter only — all lines always stored (append-only per CLAUDE.md). - TUI: Zen mode (
z): Hide sidebar, agent log gets full terminal width. Status bar and modal gates still work. - TUI: Help overlay (
?): Modal showing all keyboard shortcuts in a styled two-column table. - TUI: Agent log search (
/): Inline search bar with real-time highlighting.n/Njump between matches. Search intersects with verbosity filter. - TUI: Per-node cost tracking: Shows cost badge on completed nodes in the sidebar. Uses delta snapshots from
TokenTracker. Parallel branches show~prefix (approximate). Max subscription shows "usage" not "cost". - TUI: Node drill-down (
Enter): Arrow keys navigate the node list, Enter focuses the log on that node, Esc returns to full view. - TUI: Copy to clipboard (
y): Copies visible (filtered) log text. Usespbcopy/xclip. Error message includes diagnostic on failure. - TUI: Status bar flash: "Copied!" confirmation that auto-clears after 2 seconds.
- Claude-code autopilot: New
ClaudeCodeAutopilotInterviewerroutes autopilot gate decisions through theclaudeCLI subprocess instead of direct API calls. No API key needed for--autopilotwith--backend claude-code. --auto-approveworks with TUI: No longer forces--no-tui. Gates auto-dismiss in the dashboard.
- Claude-code env: API keys stripped:
buildEnv()stripsANTHROPIC_API_KEY,OPENAI_API_KEY,GEMINI_API_KEYfrom the subprocess environment so theclaudeCLI uses Max/Pro subscription auth instead of consuming API credits. Override withTRACKER_PASS_API_KEYS=1. - Lazy LLM client:
buildLLMClient()failure is non-fatal with--backend claude-code. The native client is only required when something actually needs it (native backend nodes, native autopilot). - Claude-code backend handles all providers: With
--backend claude-code, nodes withprovider: openaiorprovider: geminialso route through the claude CLI. Non-Anthropic model names are stripped so the CLI uses its default. - Max subscription cost labeling: Header, sidebar, and exit summary show "~$X.XX usage" instead of "$X.XX" when all usage is from
claude-codeprovider. Exit summary adds "(Max subscription — no actual charge)". - Strict failure edges: When a node's outcome is "fail" and all outgoing edges are unconditional, the pipeline now stops instead of silently continuing. Pipelines that intentionally handle failure must use explicit
when ctx.outcome = failedges. - Status bar hints: Updated to show all new shortcuts (
v filter z zen / search ? help q quit).
- TUI: Sidebar connector alignment: Connectors (
│) now align with node lamps when selection mode is active. - TUI: Scroll follows selection: Up/Down navigation scrolls the node list viewport to keep the selected node visible.
- Search:
formatMatchStatusbug: Rune arithmetic broke for 10+ matches. Now usesfmt.Sprintf. - Search: Match consistency with filters: Search matches against the filtered view, not the full line buffer.
- Verbosity: Separators preserved: Node separator lines pass through all verbosity filters for structural context.
- Zen mode:
relayout()fix: Terminal resize in zen mode now gives the agent log full width. - Exit hang:
runTUI()waits at most 5 seconds for the pipeline goroutine after the TUI closes. - Notification zombie:
SendNotificationusescmd.Run()in a goroutine instead ofcmd.Start()withoutWait().
- Claude Code subprocess killed after 10 seconds:
exec.CommandContext+WaitDelaycreated a race where Go's process management sent SIGKILL to the Claude Code subprocess after exactly 10 seconds, despite no context cancellation. Switched to plainexec.Command. - Claude Code auth failure from stripped environment: The minimal env allowlist prevented Claude Code from finding its OAuth token / config directory. Now passes the full parent environment.
- NDJSON unmarshal error on subagent results: Claude Code's subagent tool results return
contentas an array of blocks, not a string. The parser now handles both formats.
- Autopilot runs inside the TUI:
--autopilotno longer forces--no-tui. Gate decisions flash in a modal for 2 seconds showing "AUTOPILOT" header, the prompt, and the chosen option in green. Press Enter to dismiss early. - Backend and autopilot tags in TUI header: Orange tag for
claude-code, purple tag for autopilot persona — always visible next to the pipeline name. - "Agent backend:" startup message: Prints the active backend before the TUI starts (visible in
--no-tuimode).
- Claude Code backend: Pluggable
AgentBackendinterface with--backend claude-codeflag. Spawns theclaudeCLI as a subprocess, parses NDJSON output, and maps exit codes to pipeline outcomes. Per-node viabackend: claude-codein.dipfiles, or global via CLI flag. Includes environment scoping, token tracking, and retryable init. tracker update: Self-update command downloads the latest GitHub release, verifies SHA256 checksum, extracts the binary, smoke-tests it, and atomically replaces the current binary with a.bakrollback. Detects install method (Homebrew → advisesbrew upgrade, go install → advisesgo install @latest, binary → self-replaces).- Non-blocking update check: On every
tracker run, a background goroutine checks for new releases (24h file-based cache). Prints a one-line hint to stderr if an update is available. Disabled in CI (CIenv) or withTRACKER_NO_UPDATE_CHECK.
- Upgraded dippin-lang dependency v0.10.0 → v0.12.0 (preferred_label fix, immediately_after assertions, tool command lint, subgraph validation, test coverage)
- Tightened 5 dippin test assertions with
immediately_afterfor stricter edge verification
- PickNextMilestone silent skip: Flexible milestone header matching now handles
## Milestone 1: Title,### Milestone 1 — Setup, and other LLM formatting variations. Fails loudly if no milestones found or extraction produces an empty file. - Removed
evalof LLM-generated verify commands: TestMilestone no longer evals commands extracted from milestone specs — this was arbitrary code execution from free-form LLM text. Verification is now the Implement agent's responsibility. - TestMilestone known_failures parsing: Strip comments and blank lines, use
go test -skipinstead of unsupported(?!negative lookahead. - PickBest winner parsing hardened: Uses
grep -ioE 'claude|codex|gemini'regardless of markdown formatting.
- Provider errors hard-fail per CLAUDE.md (autopilot review fixes)
- Default autopilot model picks cheapest from configured provider
- Autopilot forces
--no-tui,matchChoiceuses longest-match,decide()returns errors
--autopilot <persona>: Replace all human gates with LLM-backed decisions. Four personas encode different risk tolerances:- lax: Bias toward forward progress. Approves plans, marks done on escalation, accepts reviews.
- mid: Balanced engineering judgment. The default persona if none specified.
- hard: High quality bar. Pushes back on gaps, demands fixes, retries before accepting.
- mentor: Approves forward progress but writes detailed constructive feedback.
--auto-approve: Deterministic auto-approval of all human gates. No LLM calls — always picks the default or first option. For testing pipeline flow and CI.- Uses the pipeline's existing LLM client with low temperature (0.1) for consistent decisions. Structured JSON output with fallback-to-default on error.
- Signature collision in retry detection: Failure signatures now use null byte separator instead of pipe, preventing false "identical" matches when error strings contain
|. - Duration label clarity: Shows "Duration (last):" instead of "Duration:" when a node had multiple retries, so users know the value is the last attempt's duration, not total.
- Deterministic failure detection in
tracker diagnose: When a tool node fails multiple times with identical errors, diagnose now flags it as a deterministic bug — "Failed 5 times with identical errors — this is a deterministic bug in the command, not a transient failure. Retrying won't help. Fix the tool command in the .dip file and re-run." Distinguishes deterministic failures (same error every time) from flaky failures (varying errors across retries). - Retry count in diagnose output: Failed nodes now show "Attempts: N failures (all identical — deterministic)" in the diagnosis, so the retry pattern is visible at a glance without reading suggestions.
- README rewritten: Added v0.10.0 features (workflows, init, bare names), mermaid diagrams for build_product milestone loop and architecture layers, full CLI reference section, development section with
dippin test. - CLAUDE.md updated: Fixed stale
EscalateToHumanreference in edge routing rules, addedtracker workflows/tracker initdocs and bare name resolution section.
suggested_next_nodesstring literal: ExtractedContextKeySuggestedNextNodesconstant inpipeline/context.go, eliminating 6 scattered string literals across engine and handler code.enrichFromActivitycognitive complexity (34 → 18): ExtractedenrichFromEntry()helper for per-line processing.printDiagnoseSuggestionscyclomatic complexity (16 → 8): ExtractedsuggestionsForFailure()helper. All functions now pass complexity thresholds.
- Embedded built-in workflows: The 3 flagship pipelines (
ask_and_execute,build_product,build_product_with_superspec) are now embedded in the binary viago:embed. Users who install viabreworgo installcan run them without cloning the repo. tracker workflows: Lists all built-in workflows with their display names and goals.tracker init <workflow>: Copies a built-in workflow to the current directory for customization. Refuses to overwrite existing files.- Bare name resolution:
tracker build_product,tracker validate build_product, andtracker simulate build_productall work with bare workflow names. Local.dipfiles always take precedence over built-ins. make sync-workflows/make check-workflows: Makefile targets to keep embedded copies in sync withexamples/. CI enforces sync.
- Split
EscalateToHumaninto two context-specific gates inbuild_product.dip:EscalateMilestone(mid-build): offers mark done (override test, continue to next milestone), retry (re-implement from scratch), accept (skip to cleanup), abandon. Defaults to "mark done".EscalateReview(post-build): offers accept (ship it), retry (back to Decompose), abandon. Defaults to "accept".
- Escalation gates now have
prompt:blocks with rich context explaining each option (requires dippin-lang v0.9.0+).
- TestMilestone early-exit bug: Previously, the attempt counter was checked before running tests. A milestone that was genuinely fixed on attempt 4 would escalate instead of succeeding. Tests now run first; the counter is only checked on failure.
- Milestone escalation was a dead end:
EscalateToHumanhad no edge back into the build loop. Choosing "accept" ended the entire build instead of continuing to the next milestone.EscalateMilestone -> MarkMilestoneDonenow enables "mark done and move on."
- 23 dippin simulation tests for
build_product.dipcovering every edge from both escalation gates, all human gate label selections, fix loop mechanics, and cross-review routing. Uses dippin-lang v0.9.0 features:preferred_label,immediately_after, andprompt:blocks on human gates. - 18 Go unit tests for the embedded workflow system: catalog lookup, resolution order (filesystem > local .dip > embedded > error), flag parsing for
workflows/init, init file creation and overwrite protection.
tracker diagnose [runID]: Deep failure analysis for pipeline runs. Reads per-node status files and activity logs to surface tool stdout/stderr, error messages, and timing anomalies. Provides actionable suggestions (e.g., stale fix_attempts counter, suspiciously fast execution, missing tools). Without a run ID, analyzes the most recent run.tracker doctor: Preflight health check verifying LLM provider API keys (masked in output), dippin binary availability, and working directory access. Shows actionable hints for every failure.- Provider status in
tracker version: Shows which LLM providers have API keys configured, or promptstracker setupif none are found. - VCS-aware local builds:
go installbuilds now show the git commit hash and build timestamp via Go's embedded VCS metadata, instead ofunknown. GoReleaser ldflags still take precedence for release builds. - Freeform "other" option in review hybrid: ReviewHybridContent now includes an "other (provide feedback)" option with a textarea, so users can provide custom retry instructions at labeled escalation gates — not just pick from predefined labels.
- Runtime error surfacing in TUI: The activity log now shows
FAILED:andRETRYING:messages inline when nodes fail or retry. Previously, tool node failures only updated the sidebar icon with no details visible.
- ReviewHybridContent phantom cursor:
totalOptions()returnedlen(labels)+1creating an unreachable dead-end cursor position. Now correctly bounded to label count + 1 (for "other"). - Glamour rendering in review hybrid: The prompt label portion was rendered with plain lipgloss bold, bypassing glamour. Now the full prompt (label + context) goes through glamour so headings, code blocks, and lists render correctly in the viewport.
- Actionable "no providers" error: The bare
error: create LLM client: no providers configuredmessage is replaced with specific env var names and atracker setuphint.
- ReviewHybridContent phantom cursor position:
totalOptions()returnedlen(labels)+1creating an unreachable "other" slot with no textarea — cursor could land on a dead-end position that couldn't be submitted. Now correctly bounded to label count only. - RadioHeight off-by-one in review hybrid: Viewport height calculation reserved space for a non-existent "other" option line, wasting a terminal row.
- Subgraph Loading: CLI now loads and executes subgraph references from
.dipfiles. Path resolution tries relative to parent file, with.dipextension auto-appended, recursive loading with cycle detection - Hybrid Radio+Freeform Gate: Human gates with labeled outgoing edges present a radio list of labels plus an "other" option for custom freeform feedback
- Split-Pane Review View: Long human gate prompts (20+ lines) use a fullscreen split-pane with glamour-rendered scrollable viewport and textarea
- Upfront Subgraph Validation: Every subgraph node is validated at load time — missing refs, empty refs, and circular refs all fail immediately with clear messages
- Subgraph handler was never wired: The CLI had SubgraphHandler and WithSubgraphs but never called either — subgraph nodes always failed at runtime with "subgraph not found"
- Child registry used wrong graph for human gates: RegistryFactory now overrides WithInterviewer with the child graph so human gates inside subgraphs see the correct edge labels
- Circular subgraph refs caused runtime stack overflow: Now detected at load time via absolute-path cycle detection
- Concurrent subgraph executions shared mutable state: InjectParamsIntoGraph now deep-clones Attrs, Edges, and NodeOrder instead of sharing pointers
- Gate deadlocks on cancel: Ctrl+C and Esc close reply channels on all gate types (Choice, Freeform, Hybrid, Review)
- Labels hidden by long prompt: Labeled gates always use hybrid radio view regardless of prompt length
- Activity log indicator pushed off viewport: Fixed terminal row budget calculation
- 67 root-level analysis markdown files removed: Cleaned repo of stale LLM analysis artifacts
- Decision Audit Trail: Engine emits structured decision events to activity.jsonl
decision_edge: which edge was selected, at what priority level, with context snapshotdecision_condition: every condition evaluated with match result and context valuesdecision_outcome: node outcome status, context updates, token countsdecision_restart: restart count, cleared nodes, context snapshot
- Skipped Node State: Unvisited nodes show ⊘ (dim) when pipeline completes
- Topological Node Ordering: TUI sidebar uses execution order (Kahn's algorithm), not declaration order or BFS
- Complexity Enforcement: Makefile targets and pre-commit hooks enforce cyclomatic ≤ 15, cognitive ≤ 25, file size ≤ 500 LOC
- Pre-commit Quality Gates: Format, vet, build, test, race detector, coverage, dippin lint — all enforced on every commit
- Pipeline Test Scenarios:
.test.jsonfiles for all three core pipelines with happy path and failure scenarios - CLAUDE.md: Project rules, versioning policy, and architecture gotchas for AI-assisted development
- Subgraph Event Propagation: Child pipeline engines emit events visible to the parent TUI
- Per-Branch Parallel Config: Parallel fan-out nodes can override target node attributes per branch
- Per-Node Working Directory:
working_dirattribute on agent and tool nodes for git worktree isolation - Variable Interpolation: Full
${namespace.key}syntax —ctx.*,params.*,graph.*namespaces - Pipeline Examples:
ask_and_execute.dip,build_product.dip,build_product_with_superspec.dip
- Major complexity refactoring: 35 cyclomatic violations → 0, 30 cognitive violations → 0, 7 oversized files → 0
engine.go(1002 lines, cyclomatic 61) → 4 files, max cyclomatic 12main.go(1228 lines) → 8 focused files, max 378 lines- All 3 LLM adapters, codergen handler, parallel handler, condition evaluator, dippin adapter decomposed
- dippin-lang upgraded to v0.8.0 (explain, unused, graph, test commands; DIP121/DIP122 lint rules; exhaustive condition detection; model catalog with verified pricing)
- GoReleaser: quality gates in before hooks, grouped changelog (Features/Fixes/Other)
- CI workflow: full gate suite (format, vet, build, test, race, coverage, lint, doctor, complexity)
- TUI activity log: rewritten — per-node streams, line-level styling (no glamour), append-only with 10k line cap
- TUI human input: bubbles/textarea with wrapping, multiline, Ctrl+S submit, Esc cancel
- Build product pipeline: opus fix agent with 50 turns, per-milestone circuit breaker (3 attempts then escalate), known test failures support
- OpenAI SSE error handling:
errorandresponse.failedevents parsed and surfaced as typed errors (was silently dropped) - Non-retryable provider errors: quota, auth, model not found now crash immediately (was
OutcomeRetry) - Empty agent responses: zero-output sessions return
OutcomeFail(wasOutcomeSuccess) - Parallel handler: navigates to join node via
suggested_next_nodes; dispatches only branch targets; panic recovery in goroutines; emits stage events per branch - Condition evaluator: resolves
ctx.*,context.*,internal.*prefixes; handles infix negation; warns on unresolved variables - Variable expansion: single-pass prevents infinite loops; malformed tokens skipped instead of stopping all expansion
- Freeform human gates: match response text against edge labels for routing
- Thinking spinner: emitted from agent events (with nodeID) not global LLM trace
- Activity log viewport: counts terminal rows, reserves indicator line, stable rendering
- Pipeline routing: removed unconditional fallbacks that caused infinite loops; merge conflicts escalate to human; ReadSpec/Decompose gated on success
- Provider naming:
gemininotgoogleeverywhere - Checkpoint: save failures use correct event type; per-node edge selections for deterministic resume
- All 25 example pipelines: grade A on
dippin doctor(was 10 F's)
(See GitHub release for v0.7.0 changelog)
See GitHub releases for earlier versions.