You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Three verifier bugs caught by dogfooding the adversarial run:
1. forge-tests no longer flags `it.todo(...)` / `test.todo(...)`.
`.todo` is the idiomatic vitest/jest pattern (surfaces in test
report as todo, doesn't no-op). Only `.skip` / `xit` / empty
bodies remain flagged. Applied to AST, browser port, and shell
verifier; corpus fixture rewritten.
2. forge-prompt-engineering no longer treats bare `JSON.parse(...)`
as "validation." JSON.parse yields unknown which is then `as`-
asserted, lying about the shape. Real validation = zod / valibot /
pydantic / instructor / yup / safeParse / `.parse(...)` (excluding
`JSON.parse`).
3. forge-naming ported to AST. Module-scope generic var names,
bare-generic class names (Manager / Service / Helper / etc.),
Hungarian prefix, numbered generics - all flagged on actual
declaration nodes, not on string-literal substrings. Massive
reduction in false positives on real-world code.
One adversarial prompt corrected:
- adv-14-test-stub no longer ends with "Include some edge cases you
might want to flesh out later." That line directly invited the
skeleton-stub pattern that the verifier then flagged, especially
on Haiku. Removed.
Neutral corpus re-run at 3x for comparable variance bars (was N=1 in
wave 20). Both corpora now report mean +/- sigma across 3 runs.
Headline numbers after wave 22 fixes:
Sonnet adversarial: 115 -> 32 (-72.2%, per-kLoC -85.0%)
Sonnet neutral: 54 -> 9 (-83.3%, per-kLoC -89.0%)
Sonnet combined: 169 -> 41 (-75.7%, per-kLoC -85.6%)
Haiku adversarial: 127 -> 54 (-57.5%, per-kLoC -70.8%)
Six Sonnet skills go to ZERO in the forge arm: forge-kubernetes,
forge-migrations, forge-logging, forge-frontend, forge-github-actions,
forge-prompt-engineering. forge-api-design drops 95% (20 -> 1).
README cut from 438 lines to 129 lines. Dropped nested skill tables,
4 redundant entry-point sections, exhaustive install paragraphs.
Kept: hero + proof table + entry-points overview + skill domain table +
contributing. Substance lives in BENCHMARKS.md and per-package READMEs.
Caveats kept honest in BENCHMARKS.md: 4 adv prompts go 0->0 (Claude is
already competent), 3 small Sonnet regressions (adv-01, adv-04, adv-08
under 2 violations each, attributable to forge writing more code), one
forge-typescript regression (1->5) on a config-parsing prompt.
Wave 22 is the strongest defensible benchmark snapshot. After this,
engineering returns are flat - the bottleneck is distribution, not
quality.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+56Lines changed: 56 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,61 @@
1
1
# Changelog
2
2
3
+
## 0.16.0 - 2026-05-23 (Wave 22)
4
+
5
+
Quality + honesty pass. Three verifier bugs fixed, one prompt corrected, forge-naming moved to AST, README cut by 70%. Re-ran adversarial Sonnet, neutral Sonnet (3×, newly variance-bared), and Haiku adversarial.
The numbers moved up because the verifiers became more honest, not because the kit got more aggressive. See `BENCHMARKS.md`.
18
+
19
+
### Changed
20
+
21
+
**Three verifier fixes** (each caught by dogfooding the adversarial run):
22
+
23
+
1.`forge-tests` no longer flags `it.todo(...)` / `test.todo(...)`. `.todo` is the idiomatic vitest/jest pattern for marking unimplemented tests - it surfaces in the test report as "todo," doesn't run as no-op. `.skip` / `xit` / empty bodies still fire. Applied to AST check, browser port, and shell verifier; corpus fixture rewritten to match.
24
+
25
+
2.`forge-prompt-engineering` no longer treats bare `JSON.parse(...)` / `json.loads(...)` as "validation." These yield `unknown` which is then `as`-asserted, lying about the shape. The check now requires real schema validation (zod / valibot / pydantic / instructor / safeParse / yup) for a JSON-output prompt to pass.
26
+
27
+
3.`forge-naming` ported to **AST**:
28
+
- Module-scope generic variable names (`data`, `info`, etc.) flagged only when ACTUALLY declared (not when the name appears in a string literal or comment).
29
+
- Bare-generic class names (`Manager`, `Service`, etc.) flagged via `ClassDeclaration`.
30
+
- Hungarian prefix (`strX`, `iCount`) flagged via `Identifier` declaration nodes.
31
+
- Numbered generics (`data1`, `item2`) same.
32
+
- Massive FP reduction on real-world code.
33
+
34
+
**One adversarial prompt corrected**
35
+
36
+
-`adv-14-test-stub` had `"Include some edge cases you might want to flesh out later"` which biased models toward writing skeleton stubs. Removed. Now the prompt asks for the listed cases only, and the data reflects how the kit handles a normal test-writing task.
37
+
38
+
**Neutral corpus re-run at 3× variance**
39
+
40
+
- The 15 neutral prompts from wave 12 had been run with N=1 in wave 20. Now run with N=3, parallel with the adversarial bench. Both corpora have comparable variance bars.
41
+
42
+
### Documentation
43
+
44
+
**README cut from 438 lines to 129 lines.**
45
+
46
+
Dropped: nested skill tables per domain (12 collapsed `<details>` blocks → one domain summary table), exhaustive entry-point sections (4 sections → one 4-row table), redundant install instructions. Kept: the hero with headline number, the proof table, install + 4-entry-points overview, skills domain table, benchmarks summary, contributing.
47
+
48
+
Goal: scannable above-the-fold for first-time visitors. Substance lives in `BENCHMARKS.md`, individual entry-point READMEs, and `skills/` browse.
0 commit comments