|
| 1 | +# Session Report: Building hinton-problems via Agent Teams |
| 2 | + |
| 3 | +**Session ID:** `d8af4bb0-1435-4528-a5da-ac91c30b7bcb` |
| 4 | +**Project:** SutroYaro (the lead session was checked out there) |
| 5 | +**Output:** [cybertronai/hinton-problems](https://github.com/cybertronai/hinton-problems) — 53 stubs, all merged |
| 6 | +**Span:** 2026-05-01 21:52 → 2026-05-04 03:35 (~30 wall hours, with overnight idle gaps) |
| 7 | +**Source:** the full jsonl is at `~/.claude/projects/-Users-yadkonrad-dev-dev-year26-feb26-SutroYaro/d8af4bb0-...jsonl` (5.1 MB, 3,033 events) |
| 8 | + |
| 9 | +This report is what the session log actually shows, suitable for a team video. |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## TL;DR for the video opener |
| 14 | + |
| 15 | +- 53 Hinton-paper stubs implemented in 30 wall hours, ~800k tokens, on Claude Opus 4.7 with the 1M-token context window. |
| 16 | +- The **SPEC was a single GitHub issue** ([#1](https://github.com/cybertronai/hinton-problems/issues/1)). |
| 17 | +- The **dispatcher was Claude Code's `agent-teams` primitive** — one team, ten waves, fresh teammates per wave. |
| 18 | +- **One human prompt of intent ("use parallel team of agents… DONT USE THE SKILL CRAP!")** turned a serial workflow into a 10-wave parallel build. |
| 19 | +- All work routed through GitHub: 18 issues, 15 PRs, audits via subagents, merges only on user approval. |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## The actual chain of events |
| 24 | + |
| 25 | +| Time (UTC) | Event | |
| 26 | +|---|---| |
| 27 | +| 05-01 21:52 | Session opens in SutroYaro | |
| 28 | +| 05-01 21:53 | Yad invokes the `sutro-sync` skill — the only skill call in the session — to pull Telegram chat, Google Docs, and GitHub context | |
| 29 | +| 05-01 21:59 | Yad: *"lets focus on hinton, pull it into may26, ok, shall we pull hinto-porblems, make a branch and then try doing SPEC, branch, and then github issues / what do we think?"* — the SPEC-first idea is born | |
| 30 | +| 05-01 22:04 | Yad: *"Don't merge anything, we need to pull it, we need to open up a GitHub issue and create the spec as a GitHub issue saying that's what you will follow"* — SPEC = issue | |
| 31 | +| 05-01 22:13 | **Issue #1 opened**: *Spec: minimum implementation requirements for stub problems (v1)*. Authored by `agent-0bserver07 (Claude Code) on behalf of Yad`. Lists required files, 8-section README template, reproducibility rules, acceptance checklist. | |
| 32 | +| 05-02 05:21 | Yad: *"ok do u see Yaroslav's comment, can that help with pre-context for our waves of agents?"* — Yaroslav had commented on issue #1 overnight | |
| 33 | +| 05-02 05:48 | Yad: *"deploy all the waves one after another given Yaroslav's comment and our spec and the local repo of hintons problems, and do branches per waves"* | |
| 34 | +| 05-02 05:51 | Yad: ***"I need you to use parallel team of agents that claude code has built in, DONT USE THE SKILL CRAP! https://code.claude.com/docs/en/agent-teams"*** | |
| 35 | +| 05-02 05:51 | Lead dispatches a `claude-code-guide` subagent to read the agent-teams docs | |
| 36 | +| 05-02 05:53 | **`TeamCreate`** — team `hinton-impl` born. agent_type `orchestrator`. Description: *"Each teammate owns one stub, works in its own worktree at `/may26/hinton-problems-waves/`, pushes branch `impl/<slug>`, opens PR. Lead is the SutroYaro session; reviews PRs and merges only on user approval."* | |
| 37 | +| 05-02 05:55 → 06:07 | **Wave 0**: single-stub spike. `xor-builder` teammate spawned, builds, opens PR #3. Sanity check passes. | |
| 38 | +| 05-02 06:07 → 09:20 | **Wave 1**: 3 teammates (`n-bit-parity-builder`, `symmetry-builder`, `negation-builder`). All three open PRs. Then shut down via `SendMessage(shutdown_request)`. | |
| 39 | +| 05-02 09:21 → 13:46 | **Wave 2**: 5 teammates (binary-addition, encoder-3-parity, encoder-4-3-4, encoder-8-3-8, encoder-backprop-8-3-8). | |
| 40 | +| 05-02 13:49 → 15:37 | **Wave 3**: 6 teammates (encoder-40-10-40, shifter, grapheme-sememe, distributed-to-local-bottleneck, t-c-discrimination, recurrent-shift-register). | |
| 41 | +| 05-02 14:48 | Yad: *"why are there so many PRs? Weren't there supposed to be 5 waves?"* — turning point. From here, **multiple stubs per PR**, one PR per wave. | |
| 42 | +| 05-02 15:37 → 20:17 | **Waves 4 → 7**: 6 stubs each. PR titles read like a tour: *tier-B 1980s-90s classics*, *Helmholtz/MDL/Imax/fast-weights*, *TRBM/RTRBM/gated-RBM/RNN/factorial-VQ/eGLOM*, *MNIST cluster (FF + distillation + capsule precursor)*. | |
| 43 | +| 05-02 21:10 | **Wave 8**: 6 stubs (external-data + harder architectures). | |
| 44 | +| 05-03 04:20 | **Wave 9**: 5 stubs. 50/53 v1 done. | |
| 45 | +| 05-03 22:56 | **Wave 10**: final 3 stubs (AIR + matrix capsules). **v1 complete at 53/53.** | |
| 46 | +| 05-03 23:18 → 23:55 | Docs PRs: RESULTS.md, MkDocs site, **switch to mdBook (after Yad pushed back hard)**, 4-column catalog tables. | |
| 47 | +| 05-04 00:34 | Introduction page in mdBook style + Unlicense. | |
| 48 | +| 05-04 00:55 | **v1 gap analysis** issue opened — umbrella tracker for 25 partials + 1 non-replication. | |
| 49 | +| 05-04 01:08 | Yad: *"so whats left, coz we aending this sessions 800k"* — context budget watermark. | |
| 50 | +| 05-04 03:35 | Last event in the session log. | |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +## The SPEC (issue #1) — the actual contract |
| 55 | + |
| 56 | +The contract between Yad and every teammate was a single GitHub issue. Not chat. Not a system prompt. An issue every PR linked back to. |
| 57 | + |
| 58 | +It defined: |
| 59 | +- **Required files** per stub: `<name>.py`, `README.md`, `make_<name>_gif.py`, `visualize_<name>.py`, `<name>.gif`, `viz/` |
| 60 | +- **8 README sections**: Header / Problem / Files / Running / Results / Visualizations / Deviations / Open questions |
| 61 | +- **Reproducibility rules**: seed exposed via CLI, all hyperparameters in Results, command in §Running reproduces the number |
| 62 | +- **Acceptance checklist** (8 checkboxes): reproduces under 5 min on a laptop / final accuracy with seed / GIF / weight viz / training curves / deviations section / open questions / no `NotImplementedError` |
| 63 | +- **Out of scope for v1**: energy metric (deferred to v2 ByteDMD), GPU / large-scale runs |
| 64 | + |
| 65 | +That's the entire DSL. Every stub had to fit. |
| 66 | + |
| 67 | +--- |
| 68 | + |
| 69 | +## The orchestration model |
| 70 | + |
| 71 | +``` |
| 72 | + ┌──────────────────┐ |
| 73 | + │ hinton-impl team │ (TeamCreate, agent_type=orchestrator) |
| 74 | + └─────────┬────────┘ |
| 75 | + │ |
| 76 | + ┌────────────┼────────────┐ |
| 77 | + │ │ │ |
| 78 | + Wave 0/1/2… SendMessage Subagent dispatches |
| 79 | + │ │ |
| 80 | + ▼ ▼ |
| 81 | + ┌──────────┐ ┌──────────────┐ |
| 82 | + │ teammates │ │ Agent tool │ |
| 83 | + │ <slug>- │ │ (general- │ |
| 84 | + │ builder │ │ purpose, │ |
| 85 | + │ x53 │ │ Explore) │ |
| 86 | + └────┬─────┘ └──────┬───────┘ |
| 87 | + │ │ |
| 88 | + ▼ ▼ |
| 89 | + worktree branch PR audits, code reads |
| 90 | + impl/<slug> |
| 91 | + │ |
| 92 | + ▼ |
| 93 | + gh pr create |
| 94 | + │ |
| 95 | + ▼ |
| 96 | + PR review + merge (Yad approves) |
| 97 | + │ |
| 98 | + ▼ |
| 99 | + SendMessage(shutdown_request) |
| 100 | + │ |
| 101 | + ▼ |
| 102 | + Next wave starts fresh |
| 103 | +``` |
| 104 | + |
| 105 | +**Why fresh teammates per wave**: each teammate burns context as it builds and tests. Shutting down between waves keeps later waves running on full context windows. The lead persists; the workers turn over. |
| 106 | + |
| 107 | +--- |
| 108 | + |
| 109 | +## What the session actually used |
| 110 | + |
| 111 | +### Tool calls (in the lead session) |
| 112 | + |
| 113 | +| Tool | Calls | What for | |
| 114 | +|---|---|---| |
| 115 | +| Bash | 191 | git, gh CLI, file ops, running tests | |
| 116 | +| Read | 124 | reading paper PDFs, stub code, READMEs | |
| 117 | +| Agent | **62** | subagent dispatches (see breakdown below) | |
| 118 | +| Write | 55 | new files (READMEs, scripts, configs) | |
| 119 | +| **SendMessage** | **53** | inter-teammate messaging (mostly wave shutdowns) | |
| 120 | +| TaskUpdate | 24 | shared task list maintenance | |
| 121 | +| TaskCreate | 22 | new tasks added to the team's list | |
| 122 | +| Edit | 10 | small in-place edits | |
| 123 | +| ToolSearch | 3 | loading deferred tool schemas | |
| 124 | +| WebFetch | 2 | external doc reads | |
| 125 | +| **Skill** | **1** | only `sutro-sync` at session start | |
| 126 | +| **TeamCreate** | **1** | the `hinton-impl` team itself | |
| 127 | + |
| 128 | +### Subagent dispatches (Agent tool, n=62) |
| 129 | + |
| 130 | +| Type | Count | Use | |
| 131 | +|---|---|---| |
| 132 | +| `general-purpose` | 54 | per-stub builders ("Build xor stub for hinton-problems") | |
| 133 | +| `Explore` | 7 | PR audits, stub correctness checks, wave reviews | |
| 134 | +| `claude-code-guide` | 1 | researched the agent-teams docs at session start | |
| 135 | + |
| 136 | +### GitHub artifacts produced |
| 137 | + |
| 138 | +- **18 issues created** (1 SPEC + 15 per-stub issues for early waves + 2 follow-up: v2 ByteDMD, v1 gap analysis) |
| 139 | +- **15 PRs created** (10 wave PRs + 5 docs PRs) |
| 140 | +- **6 PRs merged via `gh pr merge`** in-session (the rest were merged separately by Yad) |
| 141 | +- **24 git pushes** |
| 142 | + |
| 143 | +### Skills |
| 144 | + |
| 145 | +- **One skill call.** `sutro-sync`, used once at the very start to pull Telegram + Google Docs + GitHub context. |
| 146 | +- After that, **Yad explicitly told the lead to use `agent-teams` instead of skills**: *"DONT USE THE SKILL CRAP!"* |
| 147 | + |
| 148 | +That's the cleanest single data point about which mechanism is right for what: |
| 149 | +- Skill = a recipe for "do this set of steps once at start" |
| 150 | +- Agent team = parallel workers with shared task list |
| 151 | +- They serve different purposes; this session used both, just sparingly. |
| 152 | + |
| 153 | +--- |
| 154 | + |
| 155 | +## The waves at a glance |
| 156 | + |
| 157 | +| Wave | Stubs | Highlights | |
| 158 | +|---|---|---| |
| 159 | +| 0 | 1 | xor (sanity check, single teammate) | |
| 160 | +| 1 | 3 | n-bit-parity, symmetry, negation | |
| 161 | +| 2 | 5 | binary-addition + encoder family | |
| 162 | +| 3 | 6 | tier-B 1980s foundational (shifter, grapheme-sememe, etc.) | |
| 163 | +| 4 | 6 | tier-B 1980s-90s classics | |
| 164 | +| 5 | 6 | Helmholtz / MDL / Imax / fast-weights | |
| 165 | +| 6 | 6 | TRBM / RTRBM / gated-RBM / RNN / factorial-VQ / eGLOM | |
| 166 | +| 7 | 6 | MNIST cluster: FF + distillation + capsule precursor | |
| 167 | +| 8 | 6 | external-data + harder architectures | |
| 168 | +| 9 | 5 | final hard stubs (50/53 v1 done) | |
| 169 | +| 10 | 3 | AIR + matrix capsules — **v1 complete at 53/53** | |
| 170 | + |
| 171 | +Total: **53 stubs in 11 waves**. |
| 172 | + |
| 173 | +--- |
| 174 | + |
| 175 | +## Yad's interaction pattern (the human side) |
| 176 | + |
| 177 | +70 typed prompts across 30 wall hours. Most of them were one of three types: |
| 178 | + |
| 179 | +**Type A — high-leverage direction (rare, big effects):** |
| 180 | +- *"shall we pull hinton-problems, make a branch and then try doing SPEC, branch, and then github issues"* — chose the SPEC-as-issue model |
| 181 | +- *"deploy multiple parallel workstreams to get things done under our supervision, by doing waves"* — chose the wave model |
| 182 | +- *"I need you to use parallel team of agents that claude code has built in, DONT USE THE SKILL CRAP!"* — chose agent-teams |
| 183 | +- *"why are there so many PRs? Weren't there supposed to be 5 waves?"* — collapsed per-stub PRs into per-wave PRs |
| 184 | +- *"why didnt u use mdbook like i asked?"* — pushed back when the lead drifted (MkDocs got swapped for mdBook) |
| 185 | + |
| 186 | +**Type B — status checks (frequent, low cost):** |
| 187 | +- *"status?"* / *"status, what is left?"* / *"whats up"* — appears ~10 times. The lead summarises and continues. |
| 188 | + |
| 189 | +**Type C — review and merge approvals:** |
| 190 | +- *"i dont wanna merge yet, lets do audits left and then finish the implementation other waves right?"* |
| 191 | +- *"ok wana comment on the PR that are ready to merge with evidence?"* |
| 192 | +- *"Should we get up an issue with the context of the partials and the no?"* → led to the v1 gap analysis issue |
| 193 | + |
| 194 | +The session also has frustrated moments. They are part of an honest report: when the lead drifted on docs tooling, Yad swore at it, and the lead course-corrected within minutes. Worth showing in a team video as the realistic version of "human in the loop." |
| 195 | + |
| 196 | +--- |
| 197 | + |
| 198 | +## What this session actually proves |
| 199 | + |
| 200 | +1. **A SPEC issue is enough contract.** No instruction file, no role prompt — just a versioned GitHub issue every PR points at. Acceptance checklist becomes the PR review template. |
| 201 | +2. **Waves with fresh teammates beat one long-running team.** The lead persists; workers turn over per wave. This is what kept the run inside 800k tokens. |
| 202 | +3. **`agent-teams` is the dispatcher; subagents are the workers.** This session used both: `TeamCreate` to spin up the team, then `Agent(subagent_type=general-purpose)` 54 times to actually build the stubs. Each teammate's work happened inside its subagent. |
| 203 | +4. **Audits via separate Explore subagents.** 7 of the 62 Agent dispatches were reviewers, not builders. Keeps review context separate from build context. |
| 204 | +5. **GitHub is the substrate.** Issues created the work, PRs delivered it, comments coordinated with Yaroslav, merges gated on Yad. No Slack. No call. |
| 205 | +6. **One human per session is enough.** 70 prompts, mostly status checks. 5–6 of them set direction. The rest let the lead run. |
| 206 | + |
| 207 | +--- |
| 208 | + |
| 209 | +## Concrete numbers you can quote in the video |
| 210 | + |
| 211 | +- **53 / 53** Hinton-paper stubs implemented |
| 212 | +- **27 reproduce** paper claims, **25 partial** (gap documented), **1 honest non-replication** |
| 213 | +- **~30 wall hours**, with overnight idle gaps |
| 214 | +- **~800,000 tokens** on Claude Opus 4.7 (1M context) |
| 215 | +- **1 GitHub issue** as the SPEC |
| 216 | +- **1 `TeamCreate`**, **53 named teammates**, **11 waves** |
| 217 | +- **18 issues + 15 PRs** filed |
| 218 | +- **62 subagent dispatches** (54 builders + 7 auditors + 1 docs-research) |
| 219 | +- **191 bash, 124 reads, 55 writes, 53 SendMessages, 1 Skill** |
| 220 | +- **70 human prompts** total; ~6 set direction, ~10 were status checks, ~10 were merge approvals, the rest were follow-ups |
| 221 | +- **6 PRs merged in-session via `gh pr merge`**, 24 pushes |
| 222 | + |
| 223 | +--- |
| 224 | + |
| 225 | +## Suggested video shot list |
| 226 | + |
| 227 | +1. **Open on the SPEC issue** ([#1](https://github.com/cybertronai/hinton-problems/issues/1)) on screen. *"This is the entire contract."* |
| 228 | +2. **Cut to the GitHub PRs page** filtered to "wave" — show the 10 wave PRs. *"This is what came out of it."* |
| 229 | +3. **Show the agent-teams docs page** ([code.claude.com/docs/en/agent-teams](https://code.claude.com/docs/en/agent-teams)). *"This is the primitive that made parallel cheap."* |
| 230 | +4. **Show the `TeamCreate` JSON** (in this report). *"One call. One team."* |
| 231 | +5. **Walk through one wave** — pick wave 5 (Helmholtz/MDL/Imax/fast-weights, 6 stubs). Show the 6 teammate names, the 6 PRs, the merged commit. |
| 232 | +6. **Show a single per-stub README** (e.g., `encoder-4-2-4`) — show how it satisfies all 8 spec sections. |
| 233 | +7. **Show the v1 gap analysis issue** at the end. *"v1 = correctness. v1.5 = paper parity. v2 = energy."* |
| 234 | +8. **Close on the bottom-line numbers** (53 / 30 hr / 800k / 1 spec / 11 waves). |
| 235 | + |
| 236 | +--- |
| 237 | + |
| 238 | +*Generated from the live session log on 2026-05-04. Throwaway artifact — delete after the video is recorded.* |
0 commit comments