| name | skillify | ||||||
|---|---|---|---|---|---|---|---|
| version | 1.1.0 | ||||||
| description | The meta skill. Turn any raw feature into a properly-skilled, tested, resolvable unit of agent capability. Cross-modal eval is the recommended Phase 3 quality gate: 3 frontier models from different providers critique the output, you iterate to quality, THEN write tests that lock in the proven-good behavior. | ||||||
| triggers |
|
||||||
| tools |
|
||||||
| mutating | true |
Relationship to
/cross-modal-review: That skill is the manual mid-flow "second opinion" gate (one model reviews work product before commit). This skill's Phase 3 below usesgbrain eval cross-modalinstead — three different-provider frontier models score-and-iterate on a documented dimension list before tests cement behavior. Use/cross-modal-reviewfor ad-hoc second opinions; use Phase 3 here when skillifying a feature.
A feature is "properly skilled" when all 11 checklist items pass. Item 3 (cross-modal eval) is informational in v1.1.0 — it does not gate the skillpack-check audit, but a missing or stale receipt is surfaced so the user knows where the gate stands.
□ 1. SKILL.md — skill file with frontmatter + contract + phases
□ 2. Code — deterministic script if applicable
□ 3. Cross-modal eval — 3 frontier models from 3 providers; informational
□ 4. Unit tests — cover every branch of deterministic logic
□ 5. Integration tests — exercise live endpoints
□ 6. LLM evals — quality/correctness cases for LLM-involving steps
□ 7. Resolver trigger — entry in skills/RESOLVER.md with real user trigger phrases
□ 8. Resolver eval — test that triggers route to this skill
□ 9. Check-resolvable — DRY + MECE audit, no orphans
□ 10. E2E test — smoke test: trigger → side effect
□ 11. Brain filing — if it writes pages, entry in brain/RESOLVER.md
Before skillifying, check:
- Will this be invoked 2+ times? (One-off work ≠ skill)
- Is there >20 lines of logic? (Trivial helpers don't need full infrastructure)
- Does it have a clear trigger phrase a user would actually say?
If no to all three, it's a script, not a skill. Move on.
Feature: [name]
Code: [path]
Missing items: [check each of the 11]
---
name: my-skill
version: 1.0.0
description: |
One paragraph. What it does, when to use it.
triggers:
- "trigger phrase users actually say"
- "another real trigger"
tools:
- exec
- read
- write
mutating: false # true if it writes to brain/disk
---Body must include: Contract (what it guarantees), Phases (step-by-step), Output Format (what it produces).
Extract deterministic code into scripts/*.ts.
Tests lock in behavior. If the behavior is mediocre, tests lock in mediocrity. Cross-modal eval proves the quality bar FIRST, then tests cement it.
Choose the input that exercises the skill's hardest documented use case. If unsure: use the primary trigger example from SKILL.md, or the most complex real-world input from the last 7 days of memory files.
Run the skill on the representative input. The OUTPUT FILE is what gets evaluated.
gbrain eval cross-modal \
--task "What this skill is supposed to accomplish" \
--output skills/<slug>/SKILL.mdThe command runs 3 frontier models from 3 different providers in parallel,
scores the OUTPUT against the TASK on 5 documented dimensions, and writes a
receipt under ~/.gbrain/.gbrain/eval-receipts/<slug>-<sha8>.json (the
sha-8 binds the receipt to the current SKILL.md content — re-running after
edits writes a new receipt).
Default models (override per slot via --slot-a-model, --slot-b-model,
--slot-c-model):
| Slot | Default | Provider |
|---|---|---|
| A | openai:gpt-4o |
OpenAI |
| B | anthropic:claude-opus-4-7 |
Anthropic |
| C | google:gemini-1.5-pro |
These MUST be frontier models from DIFFERENT providers. Using a single provider's family or budget models defeats the purpose — different families have less correlated blind spots. Refresh the list when a new model generation ships.
Pass criteria (BOTH must be true):
- Every dimension's mean across successful models ≥ 7.
- No single model scored any dimension < 5 (the floor).
Inconclusive: fewer than 2 of 3 models returned parseable scores. Receipt is still written (forensics) but the gate is not authoritative. Exit code 2; CI wrappers should treat this as "did not run cleanly", not "failed quality gate".
CYCLE 1:
Eval → scores + top 10 improvements
IF pass: → done, write tests
ELSE:
Apply top 10 improvements to the actual file
Log: which improvements applied, what changed
CYCLE 2:
Re-eval the FIXED output (same 3 models, same dimensions)
Compare: before/after scores per dimension (track delta)
IF pass: → done, write tests
ELSE: apply remaining improvements + new ones
CYCLE 3 (final):
Re-eval
IF pass: → ship
ELSE: → ship with KNOWN_GAPS section listing:
- Which dimensions are still below 7
- Which improvements couldn't be resolved
- Why (e.g., "would require architectural change")
- Default
--cycles 3in TTY,--cycles 1in non-TTY (limits scripted bulk spend in CI loops). - The command prints an estimated max-cost-per-cycle from a small pricing
constant before each run. Real cost varies with prompt size; treat the
estimate as a ceiling for default
--max-tokens 4000. - A
--budget-usd Nhard cap is a v0.27.x follow-up TODO.
Models resolve through the gbrain AI gateway. Configure once with:
gbrain providers test # see what's configured
gbrain config # set keysOr set env vars: OPENAI_API_KEY, ANTHROPIC_API_KEY,
GOOGLE_GENERATIVE_AI_API_KEY, TOGETHER_API_KEY, etc. The gateway reads
from ~/.gbrain/config.json plus process.env.
3 cycles × 3 models = 9 frontier calls max per run. With Opus-class +
GPT-4o-class + Gemini-1.5-Pro, expect $1–3 per full run on default
--max-tokens 4000. Receipts include the per-call model identifiers so
you can audit retroactively.
- Output is < 200 tokens (trivial — not worth 9 API calls).
- The skill is a thin wrapper around a single API call (one cycle is enough).
NOW that eval has proven quality, write tests that lock it in:
Unit tests — every branch of deterministic logic. Mock external calls. Integration tests — hit real endpoints. Catch bugs mocks hide. LLM evals — quality/correctness for LLM steps. Lighter than cross-modal eval — test specific behaviors.
- Add to skills/RESOLVER.md with trigger phrases users ACTUALLY type
- Resolver eval: feed triggers, assert correct routing
- Check-resolvable:
- Skill reachable from skills/RESOLVER.md (not orphaned)
- No MECE overlap with other skills
- No DRY violations (shared logic in lib/, not copy-pasted)
- No ambiguous trigger routing
- E2E smoke: full pipeline from trigger to side effect
- Brain filing: add to brain/RESOLVER.md if the skill writes brain pages
bun test test/<skill>.test.ts # unit tests
gbrain skillify check skills/<slug>/scripts/<slug>.mjs --json | \
jq '.[] | .items[] | select(.name | contains("Cross-modal"))'
ls ~/.gbrain/.gbrain/eval-receipts/ # receipt landed
gbrain check-resolvable --json | jq .ok # resolver cleanPhase 0: Yes — invoked weekly, 50+ lines, clear trigger "summarize this PR"
Phase 1: Audit → SKILL.md missing, no tests, no resolver entry. Score: 1/11
Phase 2: Write SKILL.md + extract script to scripts/summarize-pr.ts
Phase 3: Cross-modal eval cycle 1 →
GPT-4o: goal=6, depth=5, specificity=4 → "misses file-level diffs"
Opus 4.7: goal=7, depth=6, specificity=5 → "no test plan in summary"
Gemini 1.5 Pro: goal=6, depth=5, specificity=5 → "template feels generic"
Aggregate: goal=6.3 FAIL, depth=5.3 FAIL
Top improvements: add file-level changes, include test plan, use PR context
→ Apply fixes → Cycle 2: goal=8, depth=7.5, specificity=7 → PASS
Phase 4: Write 12 unit tests locking in the improved behavior
Phase 5: Add "summarize this PR" trigger to skills/RESOLVER.md
Phase 6: E2E test: feed a real PR URL → verify brain page created
Phase 7: All green. Score: 11/11
NOT properly skilled until:
- All required items pass (1-2, 4-10; 11 only when applicable).
- Cross-modal eval (item 3) has a current receipt OR is explicitly waived with rationale (item 3 is informational; not blocking, but a missing receipt is visible in the audit).
- All tests pass (unit + integration + LLM evals).
- Resolver entry exists with real trigger phrases.
- Check-resolvable shows no orphans, overlaps, or DRY violations.
- Brain filing if applicable.
Skillify produces three durable artifacts per skill:
- The skill tree on disk.
skills/<slug>/SKILL.md,scripts/<slug>.mjs,routing-eval.jsonl, plus atest/<slug>.test.tsskeleton. Generated bygbrain skillify scaffold <name>and refined by the human/agent into a real implementation. - A cross-modal eval receipt at
~/.gbrain/.gbrain/eval-receipts/<slug>-<sha8>.json. The sha-8 binds the receipt to the currentSKILL.mdcontent.gbrain skillify checksurfaces the status (found/stale/missing) as informational. - An audit verdict from
gbrain skillify check:properly skilled|close — create: <missing items>|needs skillify — run /skillify on <target>. Score is<passed>/<total>. Required items gate the verdict; item 11 (cross-modal eval) is informational and never blocks PASS.
JSON output (gbrain skillify check --json) includes the same fields plus
the per-item detail string, so agents can route on the structured envelope
without parsing prose.
- ❌ Writing tests before cross-modal eval (locks in mediocrity)
- ❌ Using budget models for eval (C student grading A student)
- ❌ Using a single provider's family for all 3 slots (correlated blind spots)
- ❌ Skipping eval "because the output looks fine" (your judgment isn't 3 models)
- ❌ Eval without fix cycle (vanity metrics)
- ❌ Code with no SKILL.md (invisible to resolver)
- ❌ Tests that reimplement production code (masks real bugs)
- ❌ Resolver entry with internal jargon (must mirror real user language)
- ❌ Two skills doing the same thing (merge or kill one)
- ❌ Running cross-modal eval on trivial outputs (< 200 tokens, not worth 9 API calls)