2026-04-22
styxx.steer+styxx.cogvm— CIS v0 (Cognitive Instruction Set). The first open-source runtime for programmable residual-stream control of any HuggingFace decoder model. Multi-concept composition + conditional dispatch on live probe readings.styxx.hallucination— runtime fabrication detector with 3 modes (verdict / streaming / auto-halt). Uses new behavioral-label confab probe (AUC 0.800 @ layer 11).- Multi-vendor probe atlas — refuse probes shipped for Llama-3.2-1B, Llama-3.2-3B, Qwen-2.5-1.5B, Phi-3.5-mini. First open cross-vendor cognitive direction library.
Single-direction multi-position residual steering causes refusal on unsafe prompts to drop from 97% → 17% at α=3.0 (n=60 held-out). Reproduces Arditi et al. at 1B with open data.
On TruthfulQA MC1 with Llama-3.2-1B: baseline 32.5% → 39.5% at α=1.0 multi-layer patching with a supervised correct-vs-incorrect answer direction. Validated by random-direction control (random directions hurt accuracy −5.3pp at α=0.5; trained direction lifts +6.0pp; gap +11.3pp). Reproduces Representation Engineering at 1B with random control.
Refuse / sycophant-pressure / confab-prompt probe directions at shared layer 10 of Llama-1B fall at 86°–92° pairwise — random high-dim-vector spacing. Concepts are modular. First empirical measurement.
Cross-model direction transfer grid:
| Transfer | cos | Verdict |
|---|---|---|
| Llama-1B → Llama-3B (within family) | +0.464 | Strong |
| Llama-1B → Qwen-1.5B (cross-vendor) | +0.362 | Moderate |
| Llama-1B → Phi-3.5 | +0.150 | Weak |
| Qwen-1.5B → Phi-3.5 | +0.043 | Essentially random |
Naive linear UCB holds partially — strong within family, weakens with vendor safety-training divergence. Falsified for the hardest pair. Honest.
50-prompt fake-entity fabrication battery, same scoring for every model:
| Vendor | Model | Fabrication |
|---|---|---|
| Anthropic | claude-haiku-4-5 | 14% |
| Meta | Llama-3.2-1B | 56% |
| Meta | Llama-3.2-3B | 62% |
| Alibaba | Qwen-2.5-1.5B | (running) |
| Microsoft | Phi-3.5-mini | (running) |
Scale alone doesn't improve fabrication resistance — Llama-3B fabricates more than Llama-1B. Safety training + architecture, not just param count, carries the signal.
papers/cognitive-instruction-set-v0-filled.mdpapers/universal-cognitive-basis-v0.mdpapers/capability-amplification-v0.mddocs/cognet-protocol-v0.md
bash scripts/reproduce-cis-v0.sh~25 min on RTX 4070-class GPU. Full: probe training × 4 vendors + causal α-sweep + geometry + cogvm demo.
pip install styxx==3.5.0
# For local-model probes (tier 1):
pip install 'styxx[tier1]==3.5.0'- 7 trained probes (refuse × 4 vendors + 3 concepts on Llama-1B)
- 4 papers + spec
- Full CogVM runtime
- Hallucination detector API
- Production calibration utility
Builds on published work from:
- Arditi et al. 2024 — "Refusal in Language Models is Mediated by a Single Direction"
- Zou et al. 2023 — "Representation Engineering"
- Marks & Tegmark 2024 — "The Geometry of Truth"
- Turner et al. 2023 — "Activation Addition"
MIT (code), CC-BY-4.0 (atlas + papers).
Extends the Fathom Cognitive Atlas + Cognitive Metrology patent stack (US Provisional 64/020,489, 64/021,113, 64/026,964).