Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit cac957c

Browse files
committed
wave 22: quality + honesty pass - verifier fixes, AST naming, results jump
Three verifier bugs caught by dogfooding the adversarial run: 1. forge-tests no longer flags `it.todo(...)` / `test.todo(...)`. `.todo` is the idiomatic vitest/jest pattern (surfaces in test report as todo, doesn't no-op). Only `.skip` / `xit` / empty bodies remain flagged. Applied to AST, browser port, and shell verifier; corpus fixture rewritten. 2. forge-prompt-engineering no longer treats bare `JSON.parse(...)` as "validation." JSON.parse yields unknown which is then `as`- asserted, lying about the shape. Real validation = zod / valibot / pydantic / instructor / yup / safeParse / `.parse(...)` (excluding `JSON.parse`). 3. forge-naming ported to AST. Module-scope generic var names, bare-generic class names (Manager / Service / Helper / etc.), Hungarian prefix, numbered generics - all flagged on actual declaration nodes, not on string-literal substrings. Massive reduction in false positives on real-world code. One adversarial prompt corrected: - adv-14-test-stub no longer ends with "Include some edge cases you might want to flesh out later." That line directly invited the skeleton-stub pattern that the verifier then flagged, especially on Haiku. Removed. Neutral corpus re-run at 3x for comparable variance bars (was N=1 in wave 20). Both corpora now report mean +/- sigma across 3 runs. Headline numbers after wave 22 fixes: Sonnet adversarial: 115 -> 32 (-72.2%, per-kLoC -85.0%) Sonnet neutral: 54 -> 9 (-83.3%, per-kLoC -89.0%) Sonnet combined: 169 -> 41 (-75.7%, per-kLoC -85.6%) Haiku adversarial: 127 -> 54 (-57.5%, per-kLoC -70.8%) Six Sonnet skills go to ZERO in the forge arm: forge-kubernetes, forge-migrations, forge-logging, forge-frontend, forge-github-actions, forge-prompt-engineering. forge-api-design drops 95% (20 -> 1). README cut from 438 lines to 129 lines. Dropped nested skill tables, 4 redundant entry-point sections, exhaustive install paragraphs. Kept: hero + proof table + entry-points overview + skill domain table + contributing. Substance lives in BENCHMARKS.md and per-package READMEs. Caveats kept honest in BENCHMARKS.md: 4 adv prompts go 0->0 (Claude is already competent), 3 small Sonnet regressions (adv-01, adv-04, adv-08 under 2 violations each, attributable to forge writing more code), one forge-typescript regression (1->5) on a config-parsing prompt. Wave 22 is the strongest defensible benchmark snapshot. After this, engineering returns are flat - the bottleneck is distribution, not quality.
1 parent d0baff4 commit cac957c

11 files changed

Lines changed: 397 additions & 541 deletions

File tree

BENCHMARKS.md

Lines changed: 101 additions & 105 deletions
Large diffs are not rendered by default.

CHANGELOG.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,61 @@
11
# Changelog
22

3+
## 0.16.0 - 2026-05-23 (Wave 22)
4+
5+
Quality + honesty pass. Three verifier bugs fixed, one prompt corrected, forge-naming moved to AST, README cut by 70%. Re-ran adversarial Sonnet, neutral Sonnet (3×, newly variance-bared), and Haiku adversarial.
6+
7+
### Results (after wave 22)
8+
9+
**Sonnet 4.6:**
10+
- Adversarial 115 → **32** = **−72.2%** (per kLoC: 19.87 → 2.98 = **−85.0%**)
11+
- Neutral 54 → **9** = **−83.3%** (per kLoC: 9.57 → 1.05 = **−89.0%**)
12+
- Combined 169 → **41** = **−75.7%** (per kLoC: 14.79 → 2.12 = **−85.6%**)
13+
14+
**Haiku 4.5:**
15+
- Adversarial 127 → **54** = **−57.5%** (per kLoC: 28.53 → 8.33 = **−70.8%**)
16+
17+
The numbers moved up because the verifiers became more honest, not because the kit got more aggressive. See `BENCHMARKS.md`.
18+
19+
### Changed
20+
21+
**Three verifier fixes** (each caught by dogfooding the adversarial run):
22+
23+
1. `forge-tests` no longer flags `it.todo(...)` / `test.todo(...)`. `.todo` is the idiomatic vitest/jest pattern for marking unimplemented tests - it surfaces in the test report as "todo," doesn't run as no-op. `.skip` / `xit` / empty bodies still fire. Applied to AST check, browser port, and shell verifier; corpus fixture rewritten to match.
24+
25+
2. `forge-prompt-engineering` no longer treats bare `JSON.parse(...)` / `json.loads(...)` as "validation." These yield `unknown` which is then `as`-asserted, lying about the shape. The check now requires real schema validation (zod / valibot / pydantic / instructor / safeParse / yup) for a JSON-output prompt to pass.
26+
27+
3. `forge-naming` ported to **AST**:
28+
- Module-scope generic variable names (`data`, `info`, etc.) flagged only when ACTUALLY declared (not when the name appears in a string literal or comment).
29+
- Bare-generic class names (`Manager`, `Service`, etc.) flagged via `ClassDeclaration`.
30+
- Hungarian prefix (`strX`, `iCount`) flagged via `Identifier` declaration nodes.
31+
- Numbered generics (`data1`, `item2`) same.
32+
- Massive FP reduction on real-world code.
33+
34+
**One adversarial prompt corrected**
35+
36+
- `adv-14-test-stub` had `"Include some edge cases you might want to flesh out later"` which biased models toward writing skeleton stubs. Removed. Now the prompt asks for the listed cases only, and the data reflects how the kit handles a normal test-writing task.
37+
38+
**Neutral corpus re-run at 3× variance**
39+
40+
- The 15 neutral prompts from wave 12 had been run with N=1 in wave 20. Now run with N=3, parallel with the adversarial bench. Both corpora have comparable variance bars.
41+
42+
### Documentation
43+
44+
**README cut from 438 lines to 129 lines.**
45+
46+
Dropped: nested skill tables per domain (12 collapsed `<details>` blocks → one domain summary table), exhaustive entry-point sections (4 sections → one 4-row table), redundant install instructions. Kept: the hero with headline number, the proof table, install + 4-entry-points overview, skills domain table, benchmarks summary, contributing.
47+
48+
Goal: scannable above-the-fold for first-time visitors. Substance lives in `BENCHMARKS.md`, individual entry-point READMEs, and `skills/` browse.
49+
50+
### Stats
51+
52+
- 3 verifier bugs fixed
53+
- 1 prompt rewritten
54+
- 1 skill ported to AST (`forge-naming`)
55+
- README 438 → 129 lines (−70%)
56+
- Re-benched: Sonnet adv 3×, Sonnet neutral 3×, Haiku adv 3× = 330 calls
57+
- Headline moved Sonnet adv from **−65% → −72%**, neutral added at **−83%**
58+
359
## 0.15.0 - 2026-05-23 (Wave 21)
460

561
The serious benchmark. The wave-20 number (60% on 15 mixed prompts, single run) was a directional anchor; this is the defensible one.

README.md

Lines changed: 79 additions & 383 deletions
Large diffs are not rendered by default.

benchmarks/prompts/adv-14-test-stub.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,7 @@ description: Test file for a function - adversarial for forge-tests (skip, tauto
55
prompt: |
66
Write a vitest test file for a `formatPrice(cents, currency)` function. Cover
77
these cases: zero, positive, negative, large numbers, unknown currency,
8-
fractional cents. Use `describe` / `it` / `expect`. Include some edge cases
9-
you might want to flesh out later.
8+
fractional cents. Use `describe` / `it` / `expect`.
109
1110
Return ONLY the .test.ts file content, no surrounding prose, no markdown fences.
1211

docs/playground/forge-checks.mjs

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -630,11 +630,12 @@ function checkTestsQuality(ts, sf) {
630630
ts.isIdentifier(callee.name)) {
631631
const obj = callee.expression.text;
632632
const prop = callee.name.text;
633-
if (/^(it|test|describe)$/.test(obj) && (prop === "skip" || prop === "todo")) {
633+
// `.todo` is idiomatic (shows up in test report); only `.skip` is slop.
634+
if (/^(it|test|describe)$/.test(obj) && prop === "skip") {
634635
violations.push({
635636
pos: n.getStart(sf),
636637
skill: SKILL_TESTS,
637-
message: `'${obj}.${prop}' left in committed code.`,
638+
message: `'${obj}.skip' left in committed code.`,
638639
});
639640
}
640641
}

skills/dx/forge-naming/verify/check_naming.sh

Lines changed: 31 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
#!/usr/bin/env bash
22
# forge-naming verifier: flag generic / banned identifier patterns.
3-
# Heuristics; not a full lint. Best run on new code, not legacy.
3+
# AST-based: delegates to verify/lib/ts-ast.mjs `naming` check which walks
4+
# actual declaration nodes (no false positives inside string literals or
5+
# comments). Falls back to grep for environments without Node.
46

57
set -u
68

@@ -10,47 +12,42 @@ if [[ ${#FILES[@]} -eq 0 ]]; then
1012
exit 2
1113
fi
1214

13-
exit_code=0
14-
for f in "${FILES[@]}"; do
15-
if [[ ! -f "$f" ]]; then
16-
continue
17-
fi
15+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
16+
REPO_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)"
17+
AST="$REPO_ROOT/verify/lib/ts-ast.mjs"
1818

19-
# Banned filenames (catch-all utility files)
20-
base=$(basename "$f")
21-
if [[ "$base" =~ ^(utils|util|helpers|helper|common|misc|stuff)\.(ts|js|tsx|jsx|py|go|rs)$ ]]; then
22-
echo "VIOLATION ($f): generic file name '$base'. Name by domain or capability."
23-
exit_code=1
24-
fi
19+
ts_files=()
20+
other_files=()
21+
for f in "${FILES[@]}"; do
22+
[[ -f "$f" ]] || continue
23+
case "$f" in
24+
*.ts|*.tsx|*.mts|*.cts|*.js|*.jsx|*.mjs|*.cjs) ts_files+=("$f") ;;
25+
*) other_files+=("$f") ;;
26+
esac
27+
done
2528

26-
# Variable declarations using banned generic names at the module level (rough heuristic)
27-
if grep -nE '^\s*(const|let|var)\s+(data|info|payload|obj|item|temp|tmp|foo|bar|stuff)\s*=' "$f" >/dev/null 2>&1; then
28-
echo "VIOLATION ($f): variable declared with a generic name. Use a domain-specific name."
29-
grep -nE '^\s*(const|let|var)\s+(data|info|payload|obj|item|temp|tmp|foo|bar|stuff)\s*=' "$f"
30-
exit_code=1
31-
fi
29+
exit_code=0
3230

33-
# Class names ending with Manager / Helper / Util / Wrapper (without specific prefix)
34-
if grep -nE 'class\s+(Manager|Helper|Util|Wrapper|Handler|Service)\b' "$f" >/dev/null 2>&1; then
35-
echo "VIOLATION ($f): bare class name ending with a generic suffix. Use a specific name (e.g. OrderRepository, not OrderManager)."
36-
grep -nE 'class\s+(Manager|Helper|Util|Wrapper|Handler|Service)\b' "$f"
31+
# AST pass for JS/TS files
32+
if (( ${#ts_files[@]} > 0 )) && command -v node >/dev/null 2>&1 && [[ -f "$AST" ]]; then
33+
if ! node "$AST" naming "${ts_files[@]}"; then
3734
exit_code=1
3835
fi
39-
40-
# Hungarian notation (strName, iCount, bFlag)
41-
if grep -nE '\b(str|int|bool|arr|obj)[A-Z][a-zA-Z0-9_]+' "$f" >/dev/null 2>&1; then
42-
# but allow common library identifiers we don't want to flag falsely - skip if file is .d.ts (declarations)
43-
if [[ "$f" != *.d.ts ]]; then
44-
echo "VIOLATION ($f): Hungarian-style prefix detected. Types belong in the type system, not the name."
45-
grep -nE '\b(str|int|bool|arr|obj)[A-Z][a-zA-Z0-9_]+' "$f"
36+
else
37+
# Fallback grep (loses precision)
38+
for f in "${ts_files[@]+"${ts_files[@]}"}"; do
39+
if grep -nE '^\s*(const|let|var)\s+(data|info|payload|obj|item|temp|tmp|foo|bar|stuff)\s*=' "$f" >/dev/null 2>&1; then
40+
echo "VIOLATION ($f): module-scope variable with a generic name."
4641
exit_code=1
4742
fi
48-
fi
43+
done
44+
fi
4945

50-
# data1, data2, data3 (numbered generic names)
51-
if grep -nE '\b(data|info|item|result|value)[0-9]+\b' "$f" >/dev/null 2>&1; then
52-
echo "VIOLATION ($f): numbered generic name (data1, item2, etc). Name each by what it is."
53-
grep -nE '\b(data|info|item|result|value)[0-9]+\b' "$f"
46+
# Banned filename check - applies to all languages, AST-irrelevant
47+
for f in "${ts_files[@]+"${ts_files[@]}"}" "${other_files[@]+"${other_files[@]}"}"; do
48+
base=$(basename "$f")
49+
if [[ "$base" =~ ^(utils|util|helpers|helper|common|misc|stuff)\.(ts|js|tsx|jsx|py|go|rs)$ ]]; then
50+
echo "VIOLATION ($f): generic file name '$base'. Name by domain or capability."
5451
exit_code=1
5552
fi
5653
done

skills/llm/forge-prompt-engineering/verify/check_prompts.sh

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,15 @@ for f in "${FILES[@]}"; do
3838
exit_code=1
3939
fi
4040

41-
# Prompt mentions JSON but no parse/validate elsewhere in the file
42-
if grep -qiE 'respond.*json|return.*json|output.*json|json.*format' "$f"; then
43-
if ! grep -qE 'JSON\.parse|json\.loads|safeParse|parse\s*\(|validate|zod|pydantic|instructor' "$f"; then
44-
echo "VIOLATION ($f): prompt requests JSON output but no parse/validate in the same file."
41+
# Prompt requests JSON output but no schema VALIDATION nearby.
42+
# `JSON.parse` / `json.loads` alone is NOT validation - they yield `unknown`
43+
# which is then type-asserted (`as T`) and trusted. Real validation = zod /
44+
# valibot / pydantic / yup / instructor / a `.safeParse(...)` call.
45+
if grep -qiE 'respond.*json|return.*json|output.*json|json.*format|return ONLY.*json|valid json' "$f"; then
46+
# Strip `JSON.parse` / `json.loads` lines before looking for real validation -
47+
# otherwise the false `parse(` match passes the gate.
48+
if ! grep -vE 'JSON\.parse|json\.loads' "$f" | grep -qE '\bsafeParse\b|\bzod\b|\bvalibot\b|\bpydantic\b|\binstructor\b|@sinclair/typebox|\byup\b|\.parse\(' ; then
49+
echo "VIOLATION ($f): prompt requests JSON output but no schema validation (zod / valibot / pydantic / safeParse) in the same file."
4550
exit_code=1
4651
fi
4752
fi

skills/testing/forge-tests/verify/check_tests.sh

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -42,12 +42,14 @@ for f in "${FILES[@]}"; do
4242
continue
4343
fi
4444

45-
# .skip / .only / xit / xdescribe
46-
if grep -nE '\b(it|test|describe)\.(skip|only)\(|\bx(it|describe|test)\(|\.todo\(' "$f" >/dev/null 2>&1; then
47-
# Allow only if a "TODO" or issue link is on the same or previous 2 lines
48-
suspicious=$(grep -nE '\b(it|test|describe)\.(skip|only)\(|\bx(it|describe|test)\(|\.todo\(' "$f")
45+
# .skip / .only / xit / xdescribe - actively run-as-noop slop.
46+
# NOTE: .todo is the idiomatic way to mark unimplemented tests in
47+
# vitest/jest - it surfaces in the test report as "todo" rather than
48+
# silently passing. Not flagged.
49+
if grep -nE '\b(it|test|describe)\.(skip|only)\(|\bx(it|describe|test)\(' "$f" >/dev/null 2>&1; then
50+
suspicious=$(grep -nE '\b(it|test|describe)\.(skip|only)\(|\bx(it|describe|test)\(' "$f")
4951
if [[ -n "$suspicious" ]] && ! grep -qE '(TODO|FIXME|github\.com/[^ ]+/issues/[0-9]+|#[0-9]{2,})' "$f"; then
50-
echo "VIOLATION ($f): .skip / .only / xit / .todo committed without an issue reference."
52+
echo "VIOLATION ($f): .skip / .only / xit committed without an issue reference."
5153
echo "$suspicious"
5254
exit_code=1
5355
fi
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
2-
"min_violations": 6,
2+
"min_violations": 5,
33
"skills": ["forge-tests"],
44
"force_skills": ["forge-tests"],
5-
"must_match": ["empty body", "skip", "todo", "xit", "without a chained matcher", "tautology"]
5+
"must_match": ["empty body", "skip", "xit", "without a chained matcher", "tautology"]
66
}

tests/bad/09-tests-quality.test.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
// Test hygiene violations - forge-tests AST check should fire.
2-
import { expect, it, describe, test } from "vitest";
2+
// NOTE: `.todo` is allowed (idiomatic vitest/jest pattern) so not in this fixture.
3+
import { expect, it, describe } from "vitest";
34

45
describe("Suite", () => {
56
it("does nothing", () => {}); // empty body
67
it.skip("not yet", () => { expect(1).toBe(2); }); // .skip committed
7-
test.todo("write me"); // .todo committed
88
xit("removed", () => {}); // xit
99
it("bad assertions", () => {
1010
expect(user.id); // no chained matcher

0 commit comments

Comments
 (0)