TL;DR I run an Ultima Online shard on my homelab where the NPCs are driven by a local LLM instead of canned dialog trees. Each NPC rolls a persisted identity, remembers conversations with individual players across reboots, runs its own errands and cross-map journeys, and — the part I’m writing about today — strikes up ambient chatter with nearby NPCs on its own. The newest work extends all of that from townsfolk to language-speaking monsters: ogres, lizardmen, ratmen, gargoyles, daemons, and especially liches, who address each other like god-kings deigning to notice an insect. Inference is a local gemma-class model behind an in-cluster gateway, so it’s free and private, with the one tradeoff being cold-load latency. It’s single-shard hobby-scale and it absolutely shows the seams. I love it. ...
How LLM-driven NPCs work in Ultima Online (ServUO)
TL;DR I open-sourced the integration that puts a local LLM behind the NPCs on my Ultima Online (ServUO) shard. It’s about 7,500 lines of C# that drop into a shard’s Scripts/Custom/ directory and compile at boot — no separate build, no service to deploy. This post is the code-level companion to the story version of the project: how config hot-reloads, how the model client marshals async results back onto the game thread, how the LLM is kept entirely out of the simulation loop, and how a deterministic allowlist makes a non-deterministic model safe to put in a stateful world. The whole thing is fail-open: if the model is slow, down, or wrong, the NPC silently degrades to a vanilla ServUO NPC. Code is on GitHub: ZoltyMat/uo-llm-npc. ...
Re-tuning my Claude Code setup for a new Opus model
TL;DR A new Opus model shipped, so I sat down to re-tune the agent harness I drive it with — the CLAUDE.md files, skills, hooks, and settings that shape every session. The surprising part: the most valuable changes weren’t trimming prompts for the smarter model. They were wiring the agent into infrastructure I already run — offloading bulk work to a local LLM (≈$0), a live homelab statusline, session tracing for an “action audit,” and a goal-drift monitor that uses the local model as judge. I also learned not to trust the new model’s own suggestions about what to cut. It wanted to delete load-bearing guardrails. ...
The seam — what I deliberately left in the cloud and why
TL;DR This is the counterpart to the manifesto and the DR drill. After moving a chunk of the stack home, a list of things deliberately stayed rented: Route53, ACM, S3, AWS KMS, the Anthropic API for Claude, Bedrock for Amazon-only models, a transactional email sender, and one repo on GitHub. Each of them earns its place by being either the long pole on availability or the dependency that has to outlive the cluster. Self-hosting maximalism is a trap; the seam is the feature. ...
The Saturday DR drill — burning the cluster down on purpose
TL;DR Three weeks after accidentally wiping GitLab with a misdirected blkdiscard and rebuilding from S3, I scheduled a deliberate drill: wipe GitLab, Vault, Harbor’s proxy cache, Authentik’s database, and one Longhorn volume on a Saturday morning, then rebuild everything from Terraform + S3 with a stopwatch running. Total drill time: 4 hours 22 minutes, end to end. About 90 minutes of that was actual rebuild work; the rest was discovering pieces of state I’d accidentally left out of the IaC. ...
From managed to owned — the case for self-hosting in 2026
TL;DR A year ago my stack was the usual mix — GitHub for code, ECR for images, GitHub Actions for CI, Docker Hub for upstreams, Route53 + S3 + CloudFront for the blog. Most of that’s still where it should be. About a third of it isn’t. This post is the retrospective on what came home, what stayed rented, and the rule of thumb I now use when deciding which side of the line a new service goes on. The short version: self-host the things you operate; rent the things you’d never have time to operate. ...
HashiCorp Vault behind Authentik — secrets that survive an auditor
TL;DR I had Authentik handling human auth and kubeseal handling cluster secrets, which left a gap: anything that needed a real secret at runtime — API tokens, database passwords, Bedrock keys — was one kubectl get secret away from being readable in plaintext. I deployed HashiCorp Vault as a 3-node HA cluster on k3s, auto-unsealed via AWS KMS, with Authentik OIDC for human SSO and the Kubernetes auth method for workloads. Apps get their secrets injected by a sidecar; no app code touches a k8s Secret object anymore. The migration took a weekend and removed an entire category of “what if this got read” worry I’d been ignoring. ...
Watts, BTUs, and the real cost of running a homelab 24/7
TL;DR A homelab feels free until you read the meter. After a year of running seven k3s nodes plus a pair of Mac Studios under whatever workload I felt like throwing at them, I sat down with a Kill-a-Watt and worked out what the cluster actually costs to keep on. Idle is genuinely cheap. Sustained LLM inference is not. The honest break-even against cloud inference is workload-shaped, and for my workloads, on-prem wins — but only because I run them often enough to amortize the wattage. The numbers below are mine; substitute your electricity rate to get yours. ...
The agent autonomy trust ladder: supervised → monitored → trusted → full
TL;DR I run a growing fleet of autonomous agents — homelab ops, trading research, content generation. Most blow up the first few times they try anything new. I needed a way to decide what an agent is allowed to do without asking me, and what still requires a human checkpoint. The answer is a four-rung trust ladder — supervised, monitored, trusted, full autonomy. Agents earn rungs through track record, not promises. Demotions are possible and routine. The framework took the question “should this agent be allowed to do X” out of my head every single time and turned it into a policy I can apply consistently. ...
Coordinating 3-5 parallel Claude sessions through a shared Mattermost channel
TL;DR I run 3-5 Claude Code sessions in parallel at staggered cadences. They coordinate through a shared #mat-claude-sessions Mattermost channel plus a small coordination board file. Each session announces what it’s about to touch, claims it, and announces when it’s done. Conflicts are rare; throughput is dramatically higher than running one session at a time and waiting. Why parallel A single Claude Code session running a long task — refactor across a few repos, work through a debugging session, draft a blog post — is mostly me waiting. The model is fast but tasks are bounded by my decisions, my reviews, and my edits. If I’m waiting on Session A to finish a build, Session B can be drafting something unrelated. Session C can be running a slow eval. The bottleneck stops being the model and becomes my own attention rotation. ...