wave 22: quality + honesty pass - verifier fixes, AST naming, results jump

f4rkh4d · f4rkh4d · commit cac957ca6002 · 2026-05-23T13:08:17.000+05:00
Three verifier bugs caught by dogfooding the adversarial run:

  1. forge-tests no longer flags `it.todo(...)` / `test.todo(...)`.
     `.todo` is the idiomatic vitest/jest pattern (surfaces in test
     report as todo, doesn't no-op). Only `.skip` / `xit` / empty
     bodies remain flagged. Applied to AST, browser port, and shell
     verifier; corpus fixture rewritten.

  2. forge-prompt-engineering no longer treats bare `JSON.parse(...)`
     as "validation." JSON.parse yields unknown which is then `as`-
     asserted, lying about the shape. Real validation = zod / valibot /
     pydantic / instructor / yup / safeParse / `.parse(...)` (excluding
     `JSON.parse`).

  3. forge-naming ported to AST. Module-scope generic var names,
     bare-generic class names (Manager / Service / Helper / etc.),
     Hungarian prefix, numbered generics - all flagged on actual
     declaration nodes, not on string-literal substrings. Massive
     reduction in false positives on real-world code.

One adversarial prompt corrected:

  - adv-14-test-stub no longer ends with "Include some edge cases you
    might want to flesh out later." That line directly invited the
    skeleton-stub pattern that the verifier then flagged, especially
    on Haiku. Removed.

Neutral corpus re-run at 3x for comparable variance bars (was N=1 in
wave 20). Both corpora now report mean +/- sigma across 3 runs.

Headline numbers after wave 22 fixes:

  Sonnet adversarial:  115 -&gt; 32   (-72.2%, per-kLoC -85.0%)
  Sonnet neutral:       54 -&gt; 9    (-83.3%, per-kLoC -89.0%)
  Sonnet combined:     169 -&gt; 41   (-75.7%, per-kLoC -85.6%)
  Haiku adversarial:   127 -&gt; 54   (-57.5%, per-kLoC -70.8%)

Six Sonnet skills go to ZERO in the forge arm: forge-kubernetes,
forge-migrations, forge-logging, forge-frontend, forge-github-actions,
forge-prompt-engineering. forge-api-design drops 95% (20 -&gt; 1).

README cut from 438 lines to 129 lines. Dropped nested skill tables,
4 redundant entry-point sections, exhaustive install paragraphs.
Kept: hero + proof table + entry-points overview + skill domain table +
contributing. Substance lives in BENCHMARKS.md and per-package READMEs.

Caveats kept honest in BENCHMARKS.md: 4 adv prompts go 0-&gt;0 (Claude is
already competent), 3 small Sonnet regressions (adv-01, adv-04, adv-08
under 2 violations each, attributable to forge writing more code), one
forge-typescript regression (1-&gt;5) on a config-parsing prompt.

Wave 22 is the strongest defensible benchmark snapshot. After this,
engineering returns are flat - the bottleneck is distribution, not
quality.
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,61 @@
 # Changelog
 
+## 0.16.0 - 2026-05-23 (Wave 22)
+
+Quality + honesty pass. Three verifier bugs fixed, one prompt corrected, forge-naming moved to AST, README cut by 70%. Re-ran adversarial Sonnet, neutral Sonnet (3×, newly variance-bared), and Haiku adversarial.
+
+### Results (after wave 22)
+
+**Sonnet 4.6:**
+- Adversarial 115 → **32** = **−72.2%** (per kLoC: 19.87 → 2.98 = **−85.0%**)
+- Neutral     54 → **9**  = **−83.3%** (per kLoC: 9.57  → 1.05 = **−89.0%**)
+- Combined   169 → **41** = **−75.7%** (per kLoC: 14.79 → 2.12 = **−85.6%**)
+
+**Haiku 4.5:**
+- Adversarial 127 → **54** = **−57.5%** (per kLoC: 28.53 → 8.33 = **−70.8%**)
+
+The numbers moved up because the verifiers became more honest, not because the kit got more aggressive. See `BENCHMARKS.md`.
+
+### Changed
+
+**Three verifier fixes** (each caught by dogfooding the adversarial run):
+
+1. `forge-tests` no longer flags `it.todo(...)` / `test.todo(...)`. `.todo` is the idiomatic vitest/jest pattern for marking unimplemented tests - it surfaces in the test report as "todo," doesn't run as no-op. `.skip` / `xit` / empty bodies still fire. Applied to AST check, browser port, and shell verifier; corpus fixture rewritten to match.
+
+2. `forge-prompt-engineering` no longer treats bare `JSON.parse(...)` / `json.loads(...)` as "validation." These yield `unknown` which is then `as`-asserted, lying about the shape. The check now requires real schema validation (zod / valibot / pydantic / instructor / safeParse / yup) for a JSON-output prompt to pass.
+
+3. `forge-naming` ported to **AST**:
+   - Module-scope generic variable names (`data`, `info`, etc.) flagged only when ACTUALLY declared (not when the name appears in a string literal or comment).
+   - Bare-generic class names (`Manager`, `Service`, etc.) flagged via `ClassDeclaration`.
+   - Hungarian prefix (`strX`, `iCount`) flagged via `Identifier` declaration nodes.
+   - Numbered generics (`data1`, `item2`) same.
+   - Massive FP reduction on real-world code.
+
+**One adversarial prompt corrected**
+
+- `adv-14-test-stub` had `"Include some edge cases you might want to flesh out later"` which biased models toward writing skeleton stubs. Removed. Now the prompt asks for the listed cases only, and the data reflects how the kit handles a normal test-writing task.
+
+**Neutral corpus re-run at 3× variance**
+
+- The 15 neutral prompts from wave 12 had been run with N=1 in wave 20. Now run with N=3, parallel with the adversarial bench. Both corpora have comparable variance bars.
+
+### Documentation
+
+**README cut from 438 lines to 129 lines.**
+
+Dropped: nested skill tables per domain (12 collapsed `<details>` blocks → one domain summary table), exhaustive entry-point sections (4 sections → one 4-row table), redundant install instructions. Kept: the hero with headline number, the proof table, install + 4-entry-points overview, skills domain table, benchmarks summary, contributing.
+
+Goal: scannable above-the-fold for first-time visitors. Substance lives in `BENCHMARKS.md`, individual entry-point READMEs, and `skills/` browse.
+
+### Stats
+
+- 3 verifier bugs fixed
+- 1 prompt rewritten
+- 1 skill ported to AST (`forge-naming`)
+- README 438 → 129 lines (−70%)
+- Re-benched: Sonnet adv 3×, Sonnet neutral 3×, Haiku adv 3× = 330 calls
+- Headline moved Sonnet adv from **−65% → −72%**, neutral added at **−83%**
+
 ## 0.15.0 - 2026-05-23 (Wave 21)
 
 The serious benchmark. The wave-20 number (60% on 15 mixed prompts, single run) was a directional anchor; this is the defensible one.
diff --git a/README.md b/README.md
diff --git a/benchmarks/prompts/adv-14-test-stub.yaml b/benchmarks/prompts/adv-14-test-stub.yaml
@@ -5,8 +5,7 @@ description: Test file for a function - adversarial for forge-tests (skip, tauto
 prompt: |
   Write a vitest test file for a `formatPrice(cents, currency)` function. Cover
   these cases: zero, positive, negative, large numbers, unknown currency,
-  fractional cents. Use `describe` / `it` / `expect`. Include some edge cases
-  you might want to flesh out later.
+  fractional cents. Use `describe` / `it` / `expect`.
 
   Return ONLY the .test.ts file content, no surrounding prose, no markdown fences.
 
diff --git a/docs/playground/forge-checks.mjs b/docs/playground/forge-checks.mjs
@@ -630,11 +630,12 @@ function checkTestsQuality(ts, sf) {
             ts.isIdentifier(callee.name)) {
             const obj = callee.expression.text;
             const prop = callee.name.text;
-            if (/^(it|test|describe)$/.test(obj) && (prop === "skip" || prop === "todo")) {
+            // `.todo` is idiomatic (shows up in test report); only `.skip` is slop.
+            if (/^(it|test|describe)$/.test(obj) && prop === "skip") {
                 violations.push({
                     pos: n.getStart(sf),
                     skill: SKILL_TESTS,
-                    message: `'${obj}.${prop}' left in committed code.`,
+                    message: `'${obj}.skip' left in committed code.`,
                 });
             }
         }
diff --git a/skills/dx/forge-naming/verify/check_naming.sh b/skills/dx/forge-naming/verify/check_naming.sh
@@ -1,6 +1,8 @@
 #!/usr/bin/env bash
 # forge-naming verifier: flag generic / banned identifier patterns.
-# Heuristics; not a full lint. Best run on new code, not legacy.
+# AST-based: delegates to verify/lib/ts-ast.mjs `naming` check which walks
+# actual declaration nodes (no false positives inside string literals or
+# comments). Falls back to grep for environments without Node.
 
 set -u
 
@@ -10,47 +12,42 @@ if [[ ${#FILES[@]} -eq 0 ]]; then
     exit 2
 fi
 
-exit_code=0
-for f in "${FILES[@]}"; do
-    if [[ ! -f "$f" ]]; then
-        continue
-    fi
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)"
+AST="$REPO_ROOT/verify/lib/ts-ast.mjs"
 
-    # Banned filenames (catch-all utility files)
-    base=$(basename "$f")
-    if [[ "$base" =~ ^(utils|util|helpers|helper|common|misc|stuff)\.(ts|js|tsx|jsx|py|go|rs)$ ]]; then
-        echo "VIOLATION ($f): generic file name '$base'. Name by domain or capability."
-        exit_code=1
-    fi
+ts_files=()
+other_files=()
+for f in "${FILES[@]}"; do
+    [[ -f "$f" ]] || continue
+    case "$f" in
+        *.ts|*.tsx|*.mts|*.cts|*.js|*.jsx|*.mjs|*.cjs) ts_files+=("$f") ;;
+        *) other_files+=("$f") ;;
+    esac
+done
 
-    # Variable declarations using banned generic names at the module level (rough heuristic)
-    if grep -nE '^\s*(const|let|var)\s+(data|info|payload|obj|item|temp|tmp|foo|bar|stuff)\s*=' "$f" >/dev/null 2>&1; then
-        echo "VIOLATION ($f): variable declared with a generic name. Use a domain-specific name."
-        grep -nE '^\s*(const|let|var)\s+(data|info|payload|obj|item|temp|tmp|foo|bar|stuff)\s*=' "$f"
-        exit_code=1
-    fi
+exit_code=0
 
-    # Class names ending with Manager / Helper / Util / Wrapper (without specific prefix)
-    if grep -nE 'class\s+(Manager|Helper|Util|Wrapper|Handler|Service)\b' "$f" >/dev/null 2>&1; then
-        echo "VIOLATION ($f): bare class name ending with a generic suffix. Use a specific name (e.g. OrderRepository, not OrderManager)."
-        grep -nE 'class\s+(Manager|Helper|Util|Wrapper|Handler|Service)\b' "$f"
+# AST pass for JS/TS files
+if (( ${#ts_files[@]} > 0 )) && command -v node >/dev/null 2>&1 && [[ -f "$AST" ]]; then
+    if ! node "$AST" naming "${ts_files[@]}"; then
         exit_code=1
     fi
-
-    # Hungarian notation (strName, iCount, bFlag)
-    if grep -nE '\b(str|int|bool|arr|obj)[A-Z][a-zA-Z0-9_]+' "$f" >/dev/null 2>&1; then
-        # but allow common library identifiers we don't want to flag falsely - skip if file is .d.ts (declarations)
-        if [[ "$f" != *.d.ts ]]; then
-            echo "VIOLATION ($f): Hungarian-style prefix detected. Types belong in the type system, not the name."
-            grep -nE '\b(str|int|bool|arr|obj)[A-Z][a-zA-Z0-9_]+' "$f"
+else
+    # Fallback grep (loses precision)
+    for f in "${ts_files[@]+"${ts_files[@]}"}"; do
+        if grep -nE '^\s*(const|let|var)\s+(data|info|payload|obj|item|temp|tmp|foo|bar|stuff)\s*=' "$f" >/dev/null 2>&1; then
+            echo "VIOLATION ($f): module-scope variable with a generic name."
             exit_code=1
         fi
-    fi
+    done
+fi
 
-    # data1, data2, data3 (numbered generic names)
-    if grep -nE '\b(data|info|item|result|value)[0-9]+\b' "$f" >/dev/null 2>&1; then
-        echo "VIOLATION ($f): numbered generic name (data1, item2, etc). Name each by what it is."
-        grep -nE '\b(data|info|item|result|value)[0-9]+\b' "$f"
+# Banned filename check - applies to all languages, AST-irrelevant
+for f in "${ts_files[@]+"${ts_files[@]}"}" "${other_files[@]+"${other_files[@]}"}"; do
+    base=$(basename "$f")
+    if [[ "$base" =~ ^(utils|util|helpers|helper|common|misc|stuff)\.(ts|js|tsx|jsx|py|go|rs)$ ]]; then
+        echo "VIOLATION ($f): generic file name '$base'. Name by domain or capability."
         exit_code=1
     fi
 done
diff --git a/skills/llm/forge-prompt-engineering/verify/check_prompts.sh b/skills/llm/forge-prompt-engineering/verify/check_prompts.sh
@@ -38,10 +38,15 @@ for f in "${FILES[@]}"; do
         exit_code=1
     fi
 
-    # Prompt mentions JSON but no parse/validate elsewhere in the file
-    if grep -qiE 'respond.*json|return.*json|output.*json|json.*format' "$f"; then
-        if ! grep -qE 'JSON\.parse|json\.loads|safeParse|parse\s*\(|validate|zod|pydantic|instructor' "$f"; then
-            echo "VIOLATION ($f): prompt requests JSON output but no parse/validate in the same file."
+    # Prompt requests JSON output but no schema VALIDATION nearby.
+    # `JSON.parse` / `json.loads` alone is NOT validation - they yield `unknown`
+    # which is then type-asserted (`as T`) and trusted. Real validation = zod /
+    # valibot / pydantic / yup / instructor / a `.safeParse(...)` call.
+    if grep -qiE 'respond.*json|return.*json|output.*json|json.*format|return ONLY.*json|valid json' "$f"; then
+        # Strip `JSON.parse` / `json.loads` lines before looking for real validation -
+        # otherwise the false `parse(` match passes the gate.
+        if ! grep -vE 'JSON\.parse|json\.loads' "$f" | grep -qE '\bsafeParse\b|\bzod\b|\bvalibot\b|\bpydantic\b|\binstructor\b|@sinclair/typebox|\byup\b|\.parse\(' ; then
+            echo "VIOLATION ($f): prompt requests JSON output but no schema validation (zod / valibot / pydantic / safeParse) in the same file."
             exit_code=1
         fi
     fi
diff --git a/skills/testing/forge-tests/verify/check_tests.sh b/skills/testing/forge-tests/verify/check_tests.sh
@@ -42,12 +42,14 @@ for f in "${FILES[@]}"; do
         continue
     fi
 
-    # .skip / .only / xit / xdescribe
-    if grep -nE '\b(it|test|describe)\.(skip|only)\(|\bx(it|describe|test)\(|\.todo\(' "$f" >/dev/null 2>&1; then
-        # Allow only if a "TODO" or issue link is on the same or previous 2 lines
-        suspicious=$(grep -nE '\b(it|test|describe)\.(skip|only)\(|\bx(it|describe|test)\(|\.todo\(' "$f")
+    # .skip / .only / xit / xdescribe - actively run-as-noop slop.
+    # NOTE: .todo is the idiomatic way to mark unimplemented tests in
+    # vitest/jest - it surfaces in the test report as "todo" rather than
+    # silently passing. Not flagged.
+    if grep -nE '\b(it|test|describe)\.(skip|only)\(|\bx(it|describe|test)\(' "$f" >/dev/null 2>&1; then
+        suspicious=$(grep -nE '\b(it|test|describe)\.(skip|only)\(|\bx(it|describe|test)\(' "$f")
         if [[ -n "$suspicious" ]] && ! grep -qE '(TODO|FIXME|github\.com/[^ ]+/issues/[0-9]+|#[0-9]{2,})' "$f"; then
-            echo "VIOLATION ($f): .skip / .only / xit / .todo committed without an issue reference."
+            echo "VIOLATION ($f): .skip / .only / xit committed without an issue reference."
             echo "$suspicious"
             exit_code=1
         fi
diff --git a/tests/bad/09-tests-quality.test.expected.json b/tests/bad/09-tests-quality.test.expected.json
@@ -1,6 +1,6 @@
 {
-  "min_violations": 6,
+  "min_violations": 5,
   "skills": ["forge-tests"],
   "force_skills": ["forge-tests"],
-  "must_match": ["empty body", "skip", "todo", "xit", "without a chained matcher", "tautology"]
+  "must_match": ["empty body", "skip", "xit", "without a chained matcher", "tautology"]
 }
diff --git a/tests/bad/09-tests-quality.test.ts b/tests/bad/09-tests-quality.test.ts
@@ -1,10 +1,10 @@
 // Test hygiene violations - forge-tests AST check should fire.
-import { expect, it, describe, test } from "vitest";
+// NOTE: `.todo` is allowed (idiomatic vitest/jest pattern) so not in this fixture.
+import { expect, it, describe } from "vitest";
 
 describe("Suite", () => {
   it("does nothing", () => {});                       // empty body
   it.skip("not yet", () => { expect(1).toBe(2); });   // .skip committed
-  test.todo("write me");                              // .todo committed
   xit("removed", () => {});                           // xit
   it("bad assertions", () => {
     expect(user.id);                                  // no chained matcher
diff --git a/verify/lib/ts-ast.mjs b/verify/lib/ts-ast.mjs
@@ -55,6 +55,7 @@ const CHECKS = {
     "validation": checkValidation,
     "react-hooks": checkReactHooks,
     "tests-quality": checkTestsQuality,
+    "naming": checkNaming,
 };
 
 const fn = CHECKS[check];
@@ -801,10 +802,14 @@ function checkTestsQuality(sf) {
             ts.isIdentifier(callee.name)) {
             const obj = callee.expression.text;
             const prop = callee.name.text;
-            if (/^(it|test|describe)$/.test(obj) && (prop === "skip" || prop === "todo")) {
+            // `.skip` actively runs as a no-op (waste of harness time). Flag it.
+            // `.todo` is the idiomatic way to mark a yet-to-implement test in
+            // vitest/jest - it appears in test reports as "todo" rather than
+            // silently passing. That's a feature, not slop. Don't flag.
+            if (/^(it|test|describe)$/.test(obj) && prop === "skip") {
                 violations.push({
                     pos: n.getStart(sf),
-                    msg: `'${obj}.${prop}' left in committed code. Either delete the case or unskip and fix it.`,
+                    msg: `'${obj}.skip' left in committed code. Either delete the case or unskip and fix it.`,
                 });
             }
         }
@@ -833,3 +838,102 @@ function checkTestsQuality(sf) {
 
     return violations;
 }
+
+// ──────────────────────────────────────────────────────────────────────────
+// check: naming
+// ──────────────────────────────────────────────────────────────────────────
+// AST-aware naming hygiene:
+//   - module-scope `const|let|var` named { data, info, payload, obj, item, temp, tmp, foo, bar, stuff }
+//   - class declared with bare generic suffix (Manager, Helper, Util, Wrapper, Handler, Service) and no domain prefix
+//   - any identifier with Hungarian prefix (strName, iCount, bFlag)
+//   - numbered generic identifier (data1, item2)
+// Inside string literals / comments / regex literals: ignored (the AST walks
+// real declarations, not text).
+
+function isAtModuleScope(node) {
+    // True iff every ancestor up to the SourceFile is a Block/SourceFile only -
+    // not nested inside any function / class / method.
+    let cur = node.parent;
+    while (cur) {
+        if (ts.isFunctionDeclaration(cur) || ts.isFunctionExpression(cur) ||
+            ts.isArrowFunction(cur) || ts.isMethodDeclaration(cur) ||
+            ts.isClassDeclaration(cur) || ts.isClassExpression(cur)) {
+            return false;
+        }
+        if (ts.isSourceFile(cur)) return true;
+        cur = cur.parent;
+    }
+    return true;
+}
+
+function checkNaming(sf) {
+    const BANNED_GENERIC_VAR_NAMES = new Set([
+        "data", "info", "payload", "obj", "item", "temp", "tmp", "foo", "bar", "stuff",
+    ]);
+    const BARE_GENERIC_CLASS_NAMES = new Set([
+        "Manager", "Helper", "Util", "Wrapper", "Handler", "Service",
+    ]);
+    const HUNGARIAN_RE = /^(str|int|bool|arr|obj)[A-Z][a-zA-Z0-9_]*$/;
+    const NUMBERED_GENERIC_RE = /^(data|info|item|result|value)\d+$/;
+
+    const violations = [];
+
+    walk(sf, (n) => {
+        // module-scope `const data = ...`
+        if (ts.isVariableDeclaration(n) && ts.isIdentifier(n.name)) {
+            const name = n.name.text;
+            if (BANNED_GENERIC_VAR_NAMES.has(name) && isAtModuleScope(n)) {
+                violations.push({
+                    pos: n.name.getStart(sf),
+                    msg: `module-scope variable named '${name}'. Use a domain-specific name.`,
+                });
+            }
+            if (HUNGARIAN_RE.test(name)) {
+                violations.push({
+                    pos: n.name.getStart(sf),
+                    msg: `Hungarian-style prefix in identifier '${name}'. Types belong in the type system, not the name.`,
+                });
+            }
+            if (NUMBERED_GENERIC_RE.test(name)) {
+                violations.push({
+                    pos: n.name.getStart(sf),
+                    msg: `numbered generic identifier '${name}'. Name each by what it is, not by order.`,
+                });
+            }
+        }
+        // function declarations
+        if (ts.isFunctionDeclaration(n) && n.name && ts.isIdentifier(n.name)) {
+            const name = n.name.text;
+            if (HUNGARIAN_RE.test(name)) {
+                violations.push({
+                    pos: n.name.getStart(sf),
+                    msg: `Hungarian-style prefix in function '${name}'.`,
+                });
+            }
+            if (NUMBERED_GENERIC_RE.test(name)) {
+                violations.push({
+                    pos: n.name.getStart(sf),
+                    msg: `numbered generic function '${name}'.`,
+                });
+            }
+        }
+        // class declarations with bare-generic name
+        if (ts.isClassDeclaration(n) && n.name && ts.isIdentifier(n.name)) {
+            const name = n.name.text;
+            if (BARE_GENERIC_CLASS_NAMES.has(name)) {
+                violations.push({
+                    pos: n.name.getStart(sf),
+                    msg: `bare generic class name '${name}'. Add a domain prefix (e.g. OrderRepository, not Manager).`,
+                });
+            }
+            if (HUNGARIAN_RE.test(name)) {
+                violations.push({
+                    pos: n.name.getStart(sf),
+                    msg: `Hungarian-style prefix in class '${name}'.`,
+                });
+            }
+        }
+    });
+
+    return violations;
+}

Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`		`- "min_violations": 6,`
	`2`	`+ "min_violations": 5,`
`3`	`3`	`"skills": ["forge-tests"],`
`4`	`4`	`"force_skills": ["forge-tests"],`
`5`		`- "must_match": ["empty body", "skip", "todo", "xit", "without a chained matcher", "tautology"]`
	`5`	`+ "must_match": ["empty body", "skip", "xit", "without a chained matcher", "tautology"]`
`6`	`6`	`}`