Knowledgeable Intelligent Tool-using Tabletop Yoda
An offline-first, voice-enabled fabrication lab orchestrator running on Mac Studio M3 Ultra. Think "JARVIS for your workshop" - but it actually works, runs locally, and won't spy on you. 🔒
KITTY is a technical AI habitat - a maker space purpose-built for AI models like Claude, GPT-5, Llama, Qwen, and Mistral to come "live," run research, and directly control fabrication hardware. Built on the energy-efficient Mac Studio M3 Ultra, it provides a secure network interface to 3D printers, CNC machines, test rigs, and sensing equipment.
What makes KITTY different:
- 🏠 AI Residency Model: Models can spin up for a single query or remain active for deep, after-hours projects
- 🤖 Bounded Autonomy: One KITTY-owned project per week with controlled access to printers, inventory, and research
- ♻️ Sustainable Manufacturing: Prioritizes ethically sourced materials with robotic procurement workflows
- 🚀 Idea → Prototype Pipeline: Investigate materials, estimate costs, run simulations, then orchestrate fabrication
- ⚡ Energy Efficient: Mac Studio runs indefinitely with minimal power draw
📖 Full Vision & Roadmap: See NorthStar/ProjectVision.md for the complete multi-phase implementation plan.
| Component | Purpose | Technology |
|---|---|---|
| Q4 Tool Orchestrator | Fast tool calling, ReAct agent | llama.cpp (Athene V2 Agent Q4_K_M) @ port 8083 |
| Primary Reasoner | Deep reasoning with thinking mode | Ollama (GPT-OSS 120B) @ port 11434 |
| Fallback Reasoner | (DEPRECATED) Legacy fallback only | llama.cpp (Llama 3.3 70B F16) @ port 8082 |
| Vision Model | Image understanding, multimodal | llama.cpp (Gemma 3 27B Q4_K_M) @ port 8086 |
| Summary Model | Response compression | llama.cpp (Hermes 3 8B Q4_K_M) @ port 8084 |
| Coder Model | Code generation specialist | llama.cpp (Devstral 2 123B Q5_K_M) @ port 8087 |
| Cloud Providers | Shell/Collective model selection | OpenAI GPT-5.2, Claude Sonnet 4.5, Perplexity Sonar, Gemini 2.5 |
| Service | Port | Purpose |
|---|---|---|
| Brain | 8000 | Core orchestrator, ReAct agent, intelligent routing |
| Gateway | 8080 | REST API (HAProxy load-balanced, 3 replicas) |
| CAD | 8200 | 3D model generation (Zoo, Tripo, local CadQuery) |
| Fabrication | 8300 | Printer control, queue management, mesh segmentation, Bambu Labs integration |
| Voice | 8400 | Real-time STT/TTS with local Whisper + Kokoro/Piper |
| Discovery | 8500 | Network device scanning (mDNS, SSDP, Bamboo/Snapmaker UDP) |
| Broker | 8777 | Command execution with allow-list safety |
| Images | 8600 | Stable Diffusion generation with RQ workers |
| Mem0 MCP | 8765 | Semantic memory with vector embeddings |
| Component | Purpose |
|---|---|
| Menu | Landing page with navigation cards to all sections |
| Voice | Real-time voice assistant with Local/Cloud toggle |
| Shell | Text chat with function calling, streaming, and cloud model selection |
| Projects | CAD project management with artifact browser |
| Fabrication Console | Printer status, queue management, mesh segmentation, job tracking |
| Settings | Bambu Labs login, preferences, API configuration |
| I/O Control | Feature toggles and provider management |
| Research | Autonomous research pipeline with real-time streaming |
| Vision Gallery | Reference image search and storage |
| Image Generator | Stable Diffusion generation interface |
| Material Inventory | Filament catalog and stock management |
| Print Intelligence | Success prediction and recommendations dashboard |
| Collective | Multi-agent deliberation for better decisions |
| Wall Terminal | Full-screen display mode |
| Service | Technology | Purpose |
|---|---|---|
| Load Balancer | HAProxy | Gateway traffic distribution, health checks |
| Database | PostgreSQL 16 | Audit logs, state, projects (clustering optional) |
| Cache | Redis 7 | Semantic cache, routing state, feature flags |
| Vector DB | Qdrant 1.11 | Memory embeddings, semantic search |
| Object Storage | MinIO | CAD artifacts, images, snapshots (S3-compatible) |
| Message Queue | RabbitMQ 3.12 | Async events, job distribution |
| MQTT Broker | Eclipse Mosquitto 2.0 | Device communication, printer telemetry |
| Search Engine | SearXNG | Private, local web search |
| Smart Home | Home Assistant | Device control, automation |
| Metrics | Prometheus + Grafana | Observability dashboards |
| Logs | Loki | Log aggregation |
| Traces | Tempo | Distributed tracing |
- Hardware: Mac Studio M3 Ultra recommended (256GB+ RAM for large models)
- OS: macOS 14+ with Xcode command line tools
- Software: Docker Desktop, Python 3.11, Node 20, Homebrew
# Clone the repository
git clone https://github.com/yourusername/KITT.git
cd KITT
# Install developer tools
pip install --upgrade pip pre-commit
pre-commit install
# Create environment file
cp .env.example .env
# Edit .env with your settings (see Configuration section below)
# Setup artifacts directory (for accessing 3MF/GLB files in Finder)
./ops/scripts/setup-artifacts-dir.sh
# Start everything
./ops/scripts/start-all.shAfter startup, open your browser to:
| Interface | URL | Description |
|---|---|---|
| Main UI | http://localhost:4173 | Menu landing page with all features |
| Voice | http://localhost:4173/?view=voice | Real-time voice assistant |
| API Docs | http://localhost:8080/docs | Swagger/OpenAPI documentation |
| Grafana | http://localhost:3000 | Metrics and dashboards |
KITTY includes a hybrid voice system with local-first processing, wake word detection, and cloud fallback:
┌──────────────────────────────────────────────────────────────────────────────┐
│ Voice Service (:8400) │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Wake Word Detection (Optional) │ │
│ │ Porcupine ("Hey Kitty") → Activates Listening │ │
│ └──────────────────────────────────┬──────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ ┌───────────────────────────────────────┐ │
│ │ STT (Speech) │ │ TTS (Synthesis) │ │
│ │ ┌─────────────────┐ │ │ ┌─────────────────────────────────┐ │ │
│ │ │ Local Whisper │ │ │ │ Kokoro ONNX (Apple Silicon) │ │ │
│ │ │ (base.en) │ │ │ │ am_michael (male), af (female) │ │ │
│ │ └───────┬─────────┘ │ │ └───────────────┬─────────────────┘ │ │
│ │ │ Fallback │ │ │ Fallback │ │
│ │ ┌───────▼─────────┐ │ │ ┌───────────────▼─────────────────┐ │ │
│ │ │ OpenAI API │ │ │ │ Piper TTS (Legacy) │ │ │
│ │ │ Whisper │ │ │ │ amy/ryan (22050 Hz) │ │ │
│ │ └─────────────────┘ │ │ └───────────────┬─────────────────┘ │ │
│ └───────────────────────┘ │ │ Fallback │
│ │ ┌───────────────▼─────────────────┐ │ │
│ │ │ OpenAI TTS (tts-1) │ │ │
│ │ └─────────────────────────────────┘ │ │
│ └───────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ WebSocket Handler │ │
│ Real-time streaming • PTT or Always-Listening • Adaptive chunking │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
Wake Word Detection (Porcupine)
- Engine: Picovoice Porcupine v4.0
- Wake word: "Hey Kitty" (custom trained model)
- Location:
~/.local/share/kitty/models/Hey-Kitty_en_mac_v4_0_0.ppn - Features: Low CPU usage, configurable sensitivity
Speech-to-Text (Whisper.cpp)
- Model:
base.en(English-optimized, ~150MB) - Location:
~/.cache/whisper/ggml-base.en.bin - Features: VAD, real-time transcription
Text-to-Speech (Kokoro ONNX) - Primary
- Model: Kokoro v1.0 ONNX (~82MB, optimized for Apple Silicon)
- Location:
~/.local/share/kitty/models/kokoro-v1.0.onnx - Voices:
am_michael(male cowboy),af(female), plus 20+ additional voices - Sample rate: 24000 Hz
- Features: Adaptive text chunking, CoreML acceleration, streaming
Text-to-Speech (Piper) - Fallback
- Models:
en_US-amy-medium(female),en_US-ryan-medium(male) - Location:
/Users/Shared/Coding/models/Piper/ - Sample rate: 22050 Hz
Voice Mapping (TTS):
- alloy, nova, shimmer → af/amy (female)
- echo, fable, onyx → am_michael/ryan (male)
# Voice service
VOICE_BASE_URL=http://localhost:8400
VOICE_PREFER_LOCAL=true # Use local models first
VOICE_DEFAULT_VOICE=alloy # Default TTS voice
VOICE_SAMPLE_RATE=16000 # Audio sample rate
# Local Whisper STT
WHISPER_MODEL=base.en # Model size (tiny, base, small, medium, large)
WHISPER_MODEL_PATH= # Optional custom path
# Wake Word Detection (Porcupine)
PORCUPINE_ACCESS_KEY=your-key # Get from console.picovoice.ai
WAKE_WORD_ENABLED=true # Enable wake word detection
WAKE_WORD_MODEL_PATH=~/.local/share/kitty/models/Hey-Kitty_en_mac_v4_0_0.ppn
WAKE_WORD_SENSITIVITY=0.5 # Detection sensitivity (0.0-1.0)
# Local TTS Provider Selection
LOCAL_TTS_PROVIDER=kokoro # Primary: kokoro, fallback: piper
# Kokoro TTS (Primary)
KOKORO_ENABLED=true
KOKORO_MODEL_PATH=~/.local/share/kitty/models/kokoro-v1.0.onnx
KOKORO_VOICES_PATH=~/.local/share/kitty/models/voices-v1.0.bin
KOKORO_VOICE=am_michael # am_michael (male), af (female)
# Piper TTS (Fallback)
PIPER_MODEL_DIR=/Users/Shared/Coding/models/Piper
# Cloud TTS (Final Fallback)
OPENAI_TTS_MODEL=tts-1# Start voice service standalone
./ops/scripts/start-voice-service.sh
# Stop voice service
./ops/scripts/stop-voice-service.sh
# Check voice status
curl http://localhost:8080/api/voice/status | jq .| Endpoint | Method | Description |
|---|---|---|
/api/voice/status |
GET | Provider status (local/cloud availability) |
/api/voice/transcribe |
POST | Transcribe audio to text |
/api/voice/synthesize |
POST | Convert text to speech |
/api/voice/ws |
WebSocket | Real-time bidirectional streaming |
/api/voice/chat |
POST | Full voice chat (STT → LLM → TTS) |
| Event | Direction | Description |
|---|---|---|
config |
Client → Server | Set session config (mode, voice, prefer_local) |
audio.chunk |
Client → Server | Base64 encoded audio chunk |
audio.end |
Client → Server | Signal end of speech |
wake_word.toggle |
Client → Server | Enable/disable wake word detection |
wake_word.detected |
Server → Client | Wake word triggered |
transcript |
Server → Client | STT result (partial or final) |
response.text |
Server → Client | Streaming text response |
response.audio |
Server → Client | TTS audio chunk (base64) |
function.call |
Server → Client | Tool invocation started |
function.result |
Server → Client | Tool execution result |
The UI starts with a Menu page showing all available sections:
| Section | Icon | Description |
|---|---|---|
| Voice | 🎙️ | Real-time voice assistant with STT/TTS |
| Chat Shell | 💬 | Text chat with function calling |
| Fabrication Console | 🎨 | Text-to-3D model generation |
| Projects | 📁 | CAD project management |
| Dashboard | 🖨️ | Printers, cameras, and material inventory |
| Media Hub | 🖼️ | Vision gallery and image generation |
| Research Hub | 🔬 | Research, results, and scheduling |
| Collective | 👥 | Multi-agent deliberation for better decisions |
| Intelligence | 📈 | Analytics and insights dashboard |
| Wall Terminal | 🖥️ | Full-screen display mode |
| Settings | ⚙️ | Bambu Labs, preferences, API config |
cd services/ui
# Development mode (hot reload)
npm run dev --host 0.0.0.0 --port 4173
# Production build
npm run build
npm run previewKITTY_UI_BASE=http://localhost:4173 # UI base URL
VITE_API_BASE=http://localhost:8080 # Gateway API URLKITTY includes a local-first AI coding assistant powered by Devstral 2 123B, Mistral's agentic coding model.
| Component | Description |
|---|---|
| kitty-code | Textual TUI for interactive coding sessions |
| coder-agent | Backend service with Plan→Code→Test workflow |
| Web UI | Coding page at /coding with SSE streaming |
# Install kitty-code CLI
cd services/kitty-code
pip install -e .
# Run interactive session
kitty-code "Write a REST API with FastAPI"
# Or use the web UI at http://localhost:4173/coding- Parameters: 125 billion
- Quantization: Q5_K_M (~82GB sharded GGUF)
- Context: 16,384 tokens
- Backend: llama.cpp (natively handles sharded files)
- Speed: ~5 tokens/second on Mac Studio M3 Ultra
# Download via huggingface-cli (~82GB)
huggingface-cli download bartowski/mistralai_Devstral-2-123B-Instruct-2512-GGUF \
--include "mistralai_Devstral-2-123B-Instruct-2512-Q5_K_M/*" \
--local-dir ~/models/devstral2/Q5_K_M# Enable Devstral 2 via llama.cpp
LLAMACPP_CODER_ENABLED=true
LLAMACPP_CODER_MODEL=/path/to/devstral2/Q5_K_M/.../00001-of-00003.gguf
LLAMACPP_CODER_CTX=16384
LLAMACPP_CODER_PARALLEL=2
LLAMACPP_CODER_TEMPERATURE=0.2kitty-code auto-discovers KITTY services as MCP servers:
kitty_brain → http://localhost:8000/mcp (Query routing, research)
kitty_cad → http://localhost:8200/mcp (3D model generation)
kitty_fab → http://localhost:8300/mcp (Printer control)
kitty_discovery → http://localhost:8500/mcp (Device scanning)
See services/kitty-code/README.md for full documentation.
# Start everything (llama.cpp + Docker + Voice)
./ops/scripts/start-all.sh
# Stop everything
./ops/scripts/stop-all.sh
# Start only voice service
./ops/scripts/start-voice-service.sh
# Stop only voice service
./ops/scripts/stop-voice-service.sh
# Check service status
docker compose -f infra/compose/docker-compose.yml ps# Install CLI (one-time)
pip install -e services/cli/
# Launch interactive shell
kitty-cli shell
# Inside the shell:
> /help # Show available commands
> /voice # Toggle voice mode
> /research <query> # Autonomous research
> /cad Create a hex box # Generate CAD model
> /split /path/to/model.stl # Split oversized model for printing
> /remember Ordered more PLA # Save long-term note
> /memories PLA # Recall saved notes
> /vision gandalf rubber duck # Search reference images
> /generate futuristic drone # Generate SD image
> /collective council k=3 Compare... # Multi-agent collaboration
> /exit # Exit shell
# Quick one-off queries
kitty-cli say "What printers are online?"
kitty-cli say "Turn on bench lights"# Install launcher (one-time)
pip install -e services/launcher/
# Launch unified control center
kitty
# TUI Shortcuts:
# k - Start KITTY stack
# x - Stop stack
# c - Launch CLI
# v - Launch Voice interface
# m - Launch Model Manager
# o - Open Web Console
# i - Launch I/O dashboard
# q - QuitQuery arrives → Local model (free, instant) → Confidence check
↓
High confidence? ──→ Return answer
↓
Low confidence? ──→ Escalate to cloud (budget gated)
KITTY uses a ReAct (Reasoning + Acting) agent that can:
- Reason about complex multi-step tasks
- Use tools via Model Context Protocol (MCP)
- Observe results and adapt strategy
- Iterate until task completion
- Hazard workflows: Two-step confirmation for dangerous operations
- Command allow-lists: Only pre-approved system commands execute
- Audit logging: Every tool use logged to PostgreSQL
- Budget gates: Cloud API calls require password confirmation
Generate 3D models from natural language:
kitty-cli cad "Create a phone stand with 45° angle and cable management"Providers (automatic fallback):
- Zoo API (parametric STEP)
- Tripo (mesh STL/OBJ)
- Local CadQuery (offline)
- Local FreeCAD (offline)
Multi-printer coordination with intelligent scheduling:
# List queue
./scripts/queue-cli.sh list
# Submit job
./scripts/queue-cli.sh submit /path/to/model.stl "bracket_v2" pla_black_esun 3
# Watch queue
./scripts/queue-cli.sh watchSupported Printers:
- Bamboo Labs H2D (MQTT)
- Elegoo OrangeStorm Giga (Klipper)
- Snapmaker Artisan (UDP)
- Any OctoPrint/Moonraker instance
Split oversized 3D models into printable parts (supports 3MF and STL):
# Via CLI
kitty-cli shell
> /split /path/to/large_model.3mf
# Via API
curl -X POST http://localhost:8300/api/segmentation/segment \
-H "Content-Type: application/json" \
-d '{"mesh_path": "/path/to/model.3mf", "printer_id": "bamboo_h2d"}'Features:
- 3MF native: Prefers 3MF input/output for slicer compatibility (STL also supported)
- Automatic splitting: Detects oversized models and splits into printer-fit parts
- SDF hollowing: Reduce material usage with configurable wall thickness
- Alignment joints: Dowel pin holes for accurate part assembly
- 3MF assembly output: Single file with all parts, colors, and metadata
- Configurable printers: Load build volumes from
printer_config.yaml
Segmentation Options:
| Option | Description | Default |
|---|---|---|
printer_id |
Target printer for build volume | auto-detect |
enable_hollowing |
Enable SDF-based hollowing | true |
wall_thickness_mm |
Wall thickness for hollowing | 2.0 |
joint_type |
Joint type: dowel, dovetail, pyramid, none |
dowel |
max_parts |
Maximum parts to generate | 10 |
Interfaces:
- CLI:
/splitcommand in kitty-cli shell - Voice: "Split this model for printing"
- Web UI: MeshSegmenter component in Fabrication Console
- API:
POST /api/segmentation/segmenton Fabrication service
┌──────────────────────────────────────────────────────────────────────────────┐
│ Mac Studio M3 Ultra Host │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Local AI Inference Layer │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Ollama (Primary Reasoner) :11434 │ │ │
│ │ │ GPT-OSS 120B (Thinking Mode) │ │ │
│ │ └────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ llama.cpp Servers (Metal GPU) │ │ │
│ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │
│ │ │ │Q4 Tool │ │F16 DEPR │ │Vision │ │Summary │ │Coder │ │ │ │
│ │ │ │:8083 │ │:8082 │ │:8086 │ │:8084 │ │:8087 │ │ │ │
│ │ │ │Athene V2 │ │(Fallback)│ │Gemma 27B │ │Hermes 8B │ │Qwen 32B │ │ │ │
│ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │
│ │ └────────────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Docker Compose Services │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────────┐│ │
│ │ │ HAProxy :8080 (Load Balancer) ││ │
│ │ └───────────────────────────────┬─────────────────────────────────────┘│ │
│ │ │ │ │
│ │ ┌───────────────────────────────▼─────────────────────────────────────┐│ │
│ │ │ Gateway (x3 replicas) ││ │
│ │ │ REST API, Auth, Routing, Proxy ││ │
│ │ └───────────────────────────────┬─────────────────────────────────────┘│ │
│ │ │ │ │
│ │ ┌───────────────────────────────▼─────────────────────────────────────┐│ │
│ │ │ Brain :8000 ││ │
│ │ │ Orchestrator • ReAct Agent • Research Pipeline ││ │
│ │ └──┬────────┬────────┬────────┬────────┬────────┬────────┬───────────┘│ │
│ │ │ │ │ │ │ │ │ │ │
│ │ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐ │ │
│ │ │ CAD │ │ Fab │ │Voice│ │Disc │ │Brok │ │Imgs │ │Mem0 │ │ │
│ │ │:8200│ │:8300│ │:8400│ │:8500│ │:8777│ │:8600│ │:8765│ │ │
│ │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────────────────────┐│ │
│ │ │ Storage & Infrastructure ││ │
│ │ │ PostgreSQL │ Redis │ Qdrant │ MinIO │ RabbitMQ │ Mosquitto ││ │
│ │ └───────────────────────────────────────────────────────────────────┘│ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Web UI :4173 │ │
│ │ Menu │ Voice │ Shell │ Projects │ Fab │ Research │ Settings │ ... │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
│ │ │
┌──────────▼──────────┐ ┌──────────▼──────────┐ ┌────▼────┐
│ Home Assistant │ │ 3D Printers │ │ Cameras │
│ (MQTT + REST) │ │ Bamboo │ Elegoo │ │ (Pi) │
│ Lights, Climate, │ │ Snapmaker │ Others │ │ │
│ Sensors, Locks │ │ (OctoPrint/Klipper)│ │ │
└─────────────────────┘ └─────────────────────┘ └─────────┘
# User & Safety
USER_NAME=YourName
KITTY_USER_NAME=YourName
HAZARD_CONFIRMATION_PHRASE="Confirm: proceed"
API_OVERRIDE_PASSWORD=omega
# Budget
BUDGET_PER_TASK_USD=0.50
CONFIDENCE_THRESHOLD=0.80# Ollama (Primary Reasoner)
LOCAL_REASONER_PROVIDER=ollama
OLLAMA_HOST=http://host.docker.internal:11434
OLLAMA_MODEL=gpt-oss:120b
OLLAMA_THINK=medium
# llama.cpp Q4 (Tool Orchestrator)
LLAMACPP_Q4_HOST=http://host.docker.internal:8083
LLAMACPP_Q4_MODEL=athene-v2-agent/Athene-V2-Agent-Q4_K_M.gguf
LLAMACPP_Q4_PORT=8083
# DEPRECATED: llama.cpp F16 (Legacy Fallback - only used when LOCAL_REASONER_PROVIDER=llamacpp)
# LLAMACPP_F16_HOST=http://host.docker.internal:8082
# LLAMACPP_F16_MODEL=llama-3-70b/Llama-3.3-70B-Instruct-F16/...gguf
# LLAMACPP_F16_PORT=8082
# Vision
LLAMACPP_VISION_MODEL=gemma-3-27b-it-GGUF/gemma-3-27b-it-q4_k_m.gguf
LLAMACPP_VISION_MMPROJ=gemma3_27b_mmproj/mmproj-model-f16.gguf
LLAMACPP_VISION_PORT=8086KITTY uses embedding-based semantic search to intelligently select relevant tools for each query, reducing context usage by ~90% when many tools are available.
How it works:
- Tool definitions are converted to text embeddings using
all-MiniLM-L6-v2(384 dimensions) - Embeddings are cached in Redis for cluster-wide sharing
- For each query, cosine similarity finds the most relevant tools
- Only top-k matching tools are passed to the model (instead of all 50+)
Benefits:
- Context savings: ~90% reduction (e.g., 600 tokens vs 7,500 for 50 tools)
- Better tool selection: Semantic matching beats keyword heuristics
- Cluster-ready: Redis caching shares embeddings across nodes
- Fast: ~10-15ms per search after initial model load
# Semantic tool selection (default: enabled)
USE_SEMANTIC_TOOL_SELECTION=true
EMBEDDING_MODEL=all-MiniLM-L6-v2
TOOL_SEARCH_TOP_K=5
TOOL_SEARCH_THRESHOLD=0.3Disabling: Set USE_SEMANTIC_TOOL_SELECTION=false to fall back to keyword-based selection.
KITTY supports parallel multi-agent orchestration for complex, multi-step goals. When enabled, complex queries are decomposed into parallelizable tasks executed concurrently across multiple specialized agents.
Architecture:
User Goal → Decompose (Q4) → Dependency Graph → Parallel Execute → Synthesize (GPTOSS)
│
┌───────────────┼───────────────┐
▼ ▼ ▼
[Task 1] [Task 2] [Task 3]
researcher cad_designer fabricator
│ │ │
└───────────────┼───────────────┘
▼
Final Response
Specialized Agents:
| Agent | Primary Model | Purpose | Tool Allowlist |
|---|---|---|---|
| researcher | Q4 (Athene V2) | Web research, information gathering | web_search, fetch_webpage |
| reasoner | GPTOSS 120B | Deep analysis, chain-of-thought | (none - pure reasoning) |
| cad_designer | Q4 (Athene V2) | 3D model generation | generate_cad_model, image_search |
| fabricator | Q4 (Athene V2) | Print preparation, segmentation | fabrication.* |
| coder | Devstral 2 123B | Code generation, analysis | (all code tools) |
| vision_analyst | Gemma 27B | Image understanding | vision., camera. |
| analyst | Q4 (Athene V2) | Memory search, data analysis | memory.* |
| summarizer | Hermes 8B | Response compression | (none - compression only) |
Slot Allocation (20 concurrent slots):
| Endpoint | Port | Model | Slots | Context |
|---|---|---|---|---|
| Q4 | 8083 | Athene V2 Q4 | 6 | 128K |
| GPTOSS | 11434 | GPT-OSS 120B | 2 | 65K |
| Vision | 8086 | Gemma 27B | 4 | 4K |
| Coder | 8087 | Devstral 2 123B | 2 | 16K |
| Summary | 8084 | Hermes 8B | 4 | 4K |
Configuration:
# Enable parallel agent orchestration
ENABLE_PARALLEL_AGENTS=false # Master enable flag (disabled by default)
PARALLEL_AGENT_ROLLOUT_PERCENT=0 # Gradual rollout (0-100%)
PARALLEL_AGENT_MAX_TASKS=6 # Max tasks per execution
PARALLEL_AGENT_MAX_CONCURRENT=8 # Max concurrent slot usage
PARALLEL_AGENT_COMPLEXITY_THRESHOLD=0.6 # Query complexity threshold (0.0-1.0)
# Coder Server (Devstral 2 123B via llama.cpp)
LLAMACPP_CODER_ENABLED=true
LLAMACPP_CODER_MODEL=/path/to/devstral2/Q5_K_M/mistralai_Devstral-2-123B-Instruct-2512-Q5_K_M-00001-of-00003.gguf
LLAMACPP_CODER_HOST=http://localhost:8087
LLAMACPP_CODER_PORT=8087
LLAMACPP_CODER_CTX=16384
LLAMACPP_CODER_PARALLEL=2Performance Benefits:
| Scenario | Sequential | Parallel | Improvement |
|---|---|---|---|
| 3-task research | ~45s | ~15s | 3x faster |
| 5-task CAD+fab | ~90s | ~25s | 3.6x faster |
| GPU utilization | ~15% | ~60% | 4x better |
How it works:
- Complexity Detection: Queries are scored for complexity (keywords, length, multiple questions)
- Task Decomposition: Q4 model breaks goal into independent parallelizable tasks with dependencies
- Slot Acquisition: Tasks acquire slots with exponential backoff and fallback tiers
- Parallel Execution: Independent tasks run concurrently via asyncio.gather()
- Synthesis: GPTOSS 120B aggregates all task results into final response
Enabling:
# In .env
ENABLE_PARALLEL_AGENTS=true
PARALLEL_AGENT_ROLLOUT_PERCENT=100
# Restart llama.cpp servers to pick up increased Q4 slots
./ops/scripts/llama/restart.sh# Voice service
VOICE_BASE_URL=http://localhost:8400
VOICE_PREFER_LOCAL=true
VOICE_DEFAULT_VOICE=alloy
VOICE_SAMPLE_RATE=16000
# Local STT (Whisper)
WHISPER_MODEL=base.en
WHISPER_MODEL_PATH=
# Wake Word Detection (Porcupine)
PORCUPINE_ACCESS_KEY=your-key # Get free key from console.picovoice.ai
WAKE_WORD_ENABLED=true
WAKE_WORD_MODEL_PATH=~/.local/share/kitty/models/Hey-Kitty_en_mac_v4_0_0.ppn
WAKE_WORD_SENSITIVITY=0.5
# Local TTS Provider
LOCAL_TTS_PROVIDER=kokoro # kokoro (primary) or piper (fallback)
# Kokoro TTS (Primary)
KOKORO_ENABLED=true
KOKORO_MODEL_PATH=~/.local/share/kitty/models/kokoro-v1.0.onnx
KOKORO_VOICES_PATH=~/.local/share/kitty/models/voices-v1.0.bin
# Piper TTS (Fallback)
PIPER_MODEL_DIR=/Users/Shared/Coding/models/Piper
# Cloud TTS (Final Fallback)
OPENAI_TTS_MODEL=tts-1OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
PERPLEXITY_API_KEY=pplx-...
GOOGLE_API_KEY=AIza... # For Gemini models
ZOO_API_KEY=your-zoo-key
TRIPO_API_KEY=your-tripo-keyThe Shell page and Collective Intelligence system support direct cloud model selection. When an API key is present, the corresponding cloud model becomes available in the model selector.
Supported Cloud Models (December 2025):
| UI ID | Provider | Model | Cost (per query) |
|---|---|---|---|
gpt5 |
OpenAI | GPT-5.2 | ~$0.01-0.06 |
claude |
Anthropic | Claude Sonnet 4.5 | ~$0.01-0.08 |
perplexity |
Perplexity | Sonar (web-connected) | ~$0.001-0.005 |
gemini |
Gemini 2.5 Flash | ~$0.0075-0.03 |
Features:
- Models auto-enable when API key is detected
- Fallback to local Q4 if cloud provider fails
- Cost tracking in response metadata
- Streaming support (non-blocking for cloud)
# Home Assistant
HOME_ASSISTANT_TOKEN=your-long-lived-token
# Bambu Labs (via Settings page or .env)
# Configure through http://localhost:4173/?view=settings
# Database
DATABASE_URL=postgresql://kitty:changeme@postgres:5432/kitty
REDIS_URL=redis://127.0.0.1:6379/0Local STT not available:
# Check Whisper model exists
ls ~/.cache/whisper/ggml-base.en.bin
# Download if missing
pip install whispercpp
# Model downloads automatically on first useKokoro TTS not available:
# Check Kokoro model files exist
ls ~/.local/share/kitty/models/kokoro-v1.0.onnx
ls ~/.local/share/kitty/models/voices-v1.0.bin
ls ~/.local/share/kitty/models/voices/am_michael.bin
# Copy from HowdyTTS if missing
cp /Users/Shared/Coding/HowdyTTS/models/kokoro-v1.0.onnx ~/.local/share/kitty/models/
cp /Users/Shared/Coding/HowdyTTS/models/voices-v1.0.bin ~/.local/share/kitty/models/
mkdir -p ~/.local/share/kitty/models/voices
cp /Users/Shared/Coding/HowdyTTS/models/voices/*.bin ~/.local/share/kitty/models/voices/
# Install dependencies
cd services/voice && pip install -e ".[local]"Piper TTS not available (fallback):
# Check Piper models exist
ls /Users/Shared/Coding/models/Piper/*.onnx
# Download from: https://github.com/rhasspy/piper/releases
# Copy en_US-amy-medium.onnx and en_US-ryan-medium.onnxWake word not working:
# Check Porcupine model exists
ls ~/.local/share/kitty/models/Hey-Kitty_en_mac_v4_0_0.ppn
# Verify PORCUPINE_ACCESS_KEY is set in .env
grep PORCUPINE_ACCESS_KEY .env
# Get free access key from: https://console.picovoice.ai
# Ensure pyaudio is installed
pip install pyaudioCheck voice status:
curl http://localhost:8080/api/voice/status | jq .
# Should show:
# stt.local_available: true
# tts.local_available: true
# tts.active_provider: "kokoro" (or "piper" fallback)
# capabilities.wake_word: true (if configured)# Check Docker
docker ps
# View service logs
docker compose -f infra/compose/docker-compose.yml logs brain
tail -f .logs/llamacpp-q4.log
# Restart specific service
docker compose -f infra/compose/docker-compose.yml restart gateway# Rebuild and restart UI
cd services/ui
npm run build
npm run preview
# Check gateway proxy
curl http://localhost:8080/api/voice/status| Phase | Status | Description |
|---|---|---|
| Phase 1: Core Foundation | ✅ Complete | Docker, llama.cpp, FastAPI, Home Assistant |
| Phase 2: Tool-Aware Agent | ✅ Complete | ReAct agent, MCP protocol, CAD generation |
| Phase 3: Autonomous Learning | ✅ Complete | Goal identification, research pipeline |
| Phase 3.5: Research Pipeline | ✅ Complete | 5-phase autonomous research, multi-model |
| Phase 4: Fabrication Intelligence | 🚧 90% | Voice service, dashboards, ML in progress |
| Phase 5: Safety & Access | 📋 Planned | UniFi Access, zone presence |
- Voice Service: Local Whisper STT + Kokoro/Piper TTS with cloud fallback
- Wake Word Detection: Hands-free activation with Porcupine ("Hey Kitty")
- Kokoro TTS: High-quality Apple Silicon-optimized local speech synthesis
- Menu Landing Page: Card-based navigation to all sections
- Bambu Labs Integration: Login/status via Settings page
- Gateway Voice Proxy: Full voice API proxying through gateway
- Markdown Support: UI renders formatted responses, TTS speaks clean text
- Collective Intelligence: Multi-agent deliberation for better decisions
- Print Intelligence: Success prediction and recommendations dashboard
| Metric | Value |
|---|---|
| Services | 12 FastAPI microservices |
| Docker Containers | 20+ (including infrastructure) |
| Local AI Models | 6 (GPT-OSS 120B, Q4, F16, Vision, Summary, Coder) |
| Voice Models | 5 (Whisper, Kokoro ONNX, Piper amy/ryan, Porcupine wake word) |
| Cloud Providers | 4 (OpenAI, Anthropic, Perplexity, Brave) |
| UI Pages | 16 (Menu, Voice, Shell, Projects, etc.) |
| Supported Printers | Bamboo Labs, Elegoo (Klipper), Snapmaker, OctoPrint |
| Lines of Python | 55,000+ |
| Lines of TypeScript | 18,000+ |
We welcome contributions! See CONTRIBUTING.md for guidelines.
# Development setup
pip install -e ".[dev]"
pre-commit install
# Run tests
pytest tests/ -v
# Linting
ruff check services/ --fix
ruff format services/MIT License - see LICENSE for details.
KITTY stands on the shoulders of giants:
- llama.cpp: Local LLM inference
- Whisper.cpp: Local speech recognition
- Kokoro: High-quality local TTS (ONNX)
- Piper: Fast local text-to-speech
- Picovoice Porcupine: Wake word detection
- FastAPI: Python web framework
- Home Assistant: Smart home integration
- Zoo: Parametric CAD API
- Qdrant: Vector database
Built with care for makers, by makers
KITTY: Because your workshop deserves an AI assistant that actually understands "turn that thing on over there"
🚀 Powered by Mac Studio M3 Ultra, llama.cpp, and a whole lot of caffeine