SutroYaro

Research workspace for the Sutro Group -- energy-efficient AI training, meeting weekly at South Park Commons (SF).

Docs site: https://cybertronai.github.io/SutroYaro/ License: Unlicense (Public Domain)

What this repo is for

SutroYaro is the lab's memory and dispatcher: where you go to find out what's been tried, who's working on what right now, and the agent-team patterns + sync infra that let you spin up new builds against the org's other repos.

Concretely it does five things:

Lab memory across sessions. DISCOVERIES.md is the curated "what's proven" record. New agents and contributors read it before they spend tokens re-running settled experiments. Same for docs/tasks/ (queue) and docs/catchups/ (weekly snapshots).
Agent-teams dispatcher. The hinton-problems and schmidhuber-problems builds (53+58 stubs, ~71 wall hours, ~1.8B tokens combined) were both dispatched from SutroYaro sessions. The reusable machinery — .claude/skills/, hooks, the agent-teams patterns, the JSONL token-counting methodology — lives here.
Cross-repo index. The 8-repo cybertronai org map (docs/related-repos.md, docs/research/active-threads-*) only really makes sense from SutroYaro's vantage point.
Public face for the Sutro Group. The mkdocs site at https://cybertronai.github.io/SutroYaro/ is what gets shown at meetings.
Telegram + Google Docs + GitHub integration. bin/tg-sync, src/sync_google_docs.py, the daily catchups. No per-experiment-repo would naturally own this.

What this repo is not (anymore):

Not the active research front. Yaroslav's edge work moved to ByteDMD/experiments/grid and simplified-dally-model.
Not the home for benchmark problems. Those have dedicated repos: sparse-parity-challenge, hinton-problems, schmidhuber-problems, sutro-problems.
Not the metric definition. ByteDMD and simplified-dally-model own that.

See docs/related-repos.md for the full conceptual map. A reshuffle to align the codebase with this scoped role is tracked separately.

Get Started

Clone the repo and open it with your coding agent. The workspace is structured so the agent can navigate it, run experiments, and report findings.

git clone https://github.com/cybertronai/SutroYaro.git
cd SutroYaro

# With Claude Code (recommended)
claude --dangerously-skip-permissions

# With Gemini CLI
gemini --yolo

# With Codex CLI
codex --full-auto

About --dangerously-skip-permissions (and --yolo, --full-auto). These flags let the agent run Python, edit files, and call tools without asking for approval each step. The workspace is designed for this mode because experiments require running code in tight loops. The agent can still not push to main, cannot delete results, and cannot modify locked measurement files (see LAB.md rule #9). If you prefer manual review, omit the flag and approve each action. First-time users running on a shared or production machine should start without the flag.

For a step-by-step walkthrough — cloning, first experiment, running the harness, submitting a PR — see Getting Started.

Then ask the agent anything: "What is this about?", "How do I run experiments?", "What are the latest findings?"

The agent reads CLAUDE.md (or equivalent) to understand the workspace, syncs Telegram and Google Docs for context, and can run experiments autonomously. See the one-hour walkthrough video for a demo.

Run experiments directly

# Solve 20-bit sparse parity in 0.12s (SGD)
PYTHONPATH=src python3 -m sparse_parity.fast

# Solve it in 509 microseconds (GF(2))
PYTHONPATH=src python3 src/sparse_parity/experiments/exp_gf2.py

# Run the locked evaluation harness
PYTHONPATH=src python3 src/harness.py --method gf2 --n_bits 20 --k_sparse 3

# Run the eval environment (Gymnasium)
PYTHONPATH=src python3 src/sparse_parity/eval/run_eval.py

What This Is

A structured workspace where multiple people use different AI tools (Claude Code, Gemini CLI, Codex, Antigravity) to run experiments on shared challenges. A locked evaluation harness ensures comparable results. Findings accumulate in DISCOVERIES.md.

Current challenge: sparse parity (learn XOR/parity from random {-1,+1} inputs), plus sparse sum and sparse AND.

Results So Far

36 experiments across 3 challenges. Energy metric: DMD (Data Movement Distance, Ding et al. arXiv:2312.14441), auto-measured via TrackedArray.

Method	Time	DMD	What it proves
KM-min (1 sample)	0.050s	3,087	DMD leader. Parity influence is deterministic.
SMT Backtracking	0.002s	19,532	Constraint satisfaction touches small slices.
KM Influence (5 samples)	0.001s	27,165	O(n) not O(C(n,k)). Flip each bit, measure label change.
GF(2) Gaussian Elimination	0.009s	153,745	Parity is linear over GF(2).
SGD (baseline)	0.089s	est.	The neural net solves it, just the hard way.

All 4 local learning rules (Hebbian, Predictive Coding, Equilibrium Propagation, Target Propagation) failed at chance level. Parity requires k-th order interaction detection.

Eval Environment

Gymnasium-compatible environment that tests whether an AI agent can do energy-efficient ML research. The agent picks methods, observes energy metrics, gets graded on research quality.

PYTHONPATH=src python3 -c "
import gymnasium as gym; import sparse_parity.eval
env = gym.make('SutroYaro/SparseParity-v0', metric='dmc', budget=10)
obs, _ = env.reset()
obs, r, _, _, info = env.step(5)  # try GF(2)
print(f'{info[\"method\"]}: DMC={info[\"dmc\"]}, reward={r:.2f}')
"

3 challenges, 16 methods, 72-point discovery grading rubric (12 categories)
Registry-based: add challenges and methods without editing env code
Compute backends: local (default), Modal (GPU), remote HTTP
Adapters: Anthropic tool-use, UK AISI Inspect
See AGENT_EVAL.md for the full guide

Recorded Sessions

Date	Who	Title	Link
2026-03-22	Yad	Weekly catch-up, DMC experiments, parallel agents	YouTube
2026-03-16	Yaroslav	Meeting #9: roadmap, GF(2) verification, DMC intro	YouTube

Transcripts and chapters: https://cybertronai.github.io/SutroYaro/sessions/

Weekly Catch-Up

The agent syncs Telegram, Google Docs, and GitHub into a weekly summary with action items.

Latest: https://cybertronai.github.io/SutroYaro/catchups/2026-03-22/

Run Autonomous Experiments

# Single cycle with Claude Code
bin/run-agent --tool claude --max 10

# Single cycle with Gemini CLI
bin/run-agent --tool gemini --max 10

# Overnight: 10 cycles, 5 experiments each
bin/run-agent --loop 10 --max 5 --tool claude

# Any CLI via env var
AI_CMD="my-ai-tool -p" bin/run-agent --tool custom --max 5

Each cycle: fresh AI context, reads accumulated file state, runs experiments, logs results. If a cycle crashes, the next picks up from the files.

How It Works

AGENT.md                  # What the AI agent follows (the loop)
AGENT_EVAL.md             # Guide for running the eval environment
LAB.md                    # Human experiment protocol
DISCOVERIES.md            # Accumulated knowledge (91 proven facts)
TODO.md                   # Hypothesis queue

src/
  harness.py              # Locked evaluation harness (5 methods, CLI)
  sparse_parity/
    fast.py               # Numpy solver (0.12s, optional tracker)
    tracker.py            # Legacy MemTracker
    lru_tracker.py        # DMD measurement (LRUStackTracker)
    tracked_numpy.py      # Auto-instrumented numpy wrapper
    cache_tracker.py      # Cache-aware energy model
    experiments/          # All 36 experiment scripts
    eval/                 # Gymnasium RL environment
      env.py              # SutroYaro/SparseParity-v0
      baselines.py        # Random, Greedy, Oracle agents
      grader.py           # 72-point discovery grading (12 categories)
      registry.py         # Extensible challenge/method registry
      backends.py         # Local, Modal, Remote compute
      answer_key.json     # 36 experiments as ground truth
      adapters/           # Anthropic, Inspect platform adapters

research/
  search_space.yaml       # Bounded mutation space per challenge
  questions.yaml          # Dependency graph of open questions
  log.jsonl               # Machine-readable experiment log

results/
  scoreboard.tsv          # Leaderboard with DMD column
  plots/                  # DMD visualizations
  eval/                   # Baseline evaluation results

docs/
  catchups/               # Weekly catch-up summaries
  sessions/               # Recorded session transcripts and chapters
  findings/               # One markdown report per experiment
  research/               # Survey, eval docs, literature

Safety mechanisms:

Harness integrity: SHA256 verified before and after each run
Metric isolation: agents cannot modify tracker.py, harness.py, or data.py
Circuit breaker: halts if 5+ INVALID in last 20 experiments

Multi-Researcher Workflow

Multiple people run independent experiments, then merge via PR:

# Merge a contributor's results
bin/merge-findings path/to/their-log.jsonl

# Regenerate the scoreboard
bin/merge-findings research/log.jsonl --scoreboard

See CONTRIBUTING.md for the full guide. Three levels of effort:

Low: drop raw results in contributions/ (any format)
Medium: write a findings doc using findings/_template.md
High: code + results + findings following LAB.md

Links

Docs site: https://cybertronai.github.io/SutroYaro/
Eval environment: docs
Sessions: transcripts and chapters
Weekly catch-ups: latest
Telegram: https://t.me/sutro_group
Main code repo: https://github.com/cybertronai/sutro
Meetings: Mondays 18:00 at South Park Commons (380 Brannan St, SF)

Name		Name	Last commit message	Last commit date
Latest commit History 266 Commits
.claude		.claude
.codex		.codex
.github/workflows		.github/workflows
.traces		.traces
bin		bin
checks		checks
contributions		contributions
docs		docs
findings		findings
overrides/partials		overrides/partials
plans		plans
research		research
results		results
src		src
tests		tests
.env.example		.env.example
.envrc		.envrc
.gitignore		.gitignore
.python-version		.python-version
.session-report-d8af4bb0.md		.session-report-d8af4bb0.md
AGENT.md		AGENT.md
AGENTS.md		AGENTS.md
AGENT_EVAL.md		AGENT_EVAL.md
CLAUDE.md		CLAUDE.md
CODEX.md		CODEX.md
CONTEXT.md		CONTEXT.md
CONTRIBUTING.md		CONTRIBUTING.md
DISCOVERIES.md		DISCOVERIES.md
GEMINI.md		GEMINI.md
LAB.md		LAB.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
flake.lock		flake.lock
flake.nix		flake.nix
index.ts		index.ts
mkdocs.yml		mkdocs.yml
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
survey.md		survey.md
sync_telegram.ts		sync_telegram.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SutroYaro

What this repo is for

Get Started

Run experiments directly

What This Is

Results So Far

Eval Environment

Recorded Sessions

Weekly Catch-Up

Run Autonomous Experiments

How It Works

Multi-Researcher Workflow

Links

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SutroYaro

What this repo is for

Get Started

Run experiments directly

What This Is

Results So Far

Eval Environment

Recorded Sessions

Weekly Catch-Up

Run Autonomous Experiments

How It Works

Multi-Researcher Workflow

Links

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages