The infrastructure backbone for privacy-first AI interpretation.
kanoa-mlops provides the local compute layer for the kanoa library — enabling you to interpret data science outputs (plots, tables, models) using state-of-the-art vision-language models, all running on your own hardware.
- Privacy First — Your data never leaves your machine
- Multiple Backends — Choose Ollama (easy), vLLM (fast), or cloud GPU (scalable)
- Full Observability — Prometheus + Grafana + NVIDIA DCGM monitoring stack
- Seamless Integration — Extends
kanoaCLI withserveandstopcommands
# Base install (local inference with Ollama/vLLM)
pip install kanoa-mlops
# With GCP support (for cloud deployment)
pip install kanoa-mlops[gcp]
# Everything (GCP + dev tools)
pip install kanoa-mlops[all]# Clone and install in editable mode
git clone https://github.com/lhzn-io/kanoa-mlops.git
cd kanoa-mlops
# Base environment (no GCP tools)
conda env create -f environment.yml
conda activate kanoa-mlops
# Or with GCP infrastructure tools (terraform, gcloud)
conda env create -f environment-gcp.yml
conda activate kanoa-mlops-gcpFor users adding local AI to a science project:
# Install the package
pip install kanoa-mlops
# Initialize docker templates in your project
kanoa mlops init --dir .
# Start Ollama
kanoa mlops serve ollama
# Done! Your project now has local AIFor contributors or those wanting the full monitoring stack:
# Clone the repo
git clone https://github.com/lhzn-io/kanoa-mlops.git
cd kanoa-mlops
# Start Ollama (pulls model on first run)
kanoa mlops serve ollama
# Start monitoring (optional)
kanoa mlops serve monitoringFor maximum throughput on NVIDIA GPUs:
# Download model (~14GB)
huggingface-cli download allenai/Molmo-7B-D-0924
# Start vLLM server
docker compose -f docker/vllm/docker-compose.molmo.yml up -d
# Verify
curl http://localhost:8000/health- x86/CUDA GPUs (RTX 4090, 5080, etc.): Use vLLM for best performance
- Jetson Thor/ARM64: Use Ollama for Scout (vLLM Thor image lacks bitsandbytes quantization)
- Easy Setup: Ollama handles quantization automatically (65GB Q4 vs 203GB full)
- Production: vLLM offers more control over inference parameters
For the Olmo3 32B Think model (requires significant GPU memory):
# Download model (~32GB) - requires Hugging Face authentication for gated models
huggingface-cli download allenai/Olmo-3-32B-Think
# Start vLLM server (optimized for Jetson Thor with 128GB Unified Memory)
make serve-olmo3-32b
# Verify
curl http://localhost:8000/health| Backend | Best For | Hardware | Throughput | Setup |
|---|---|---|---|---|
| Ollama | Getting started, CPU/Apple Silicon | Any | ~15 tok/s | kanoa mlops serve ollama |
| vLLM | Production, maximum speed | NVIDIA GPU | ~31 tok/s | Docker Compose |
| GCP L4 | No local GPU, team sharing | Cloud | ~25 tok/s | Terraform |
Perfect for development, VSCode integration, and broader hardware support.
kanoa mlops serve ollama
# → Ollama running at http://localhost:11434Supports: Gemma 3 (4B/12B), Llama 3, Mistral, and many more.
Optimized inference with CUDA, batching, and OpenAI-compatible API.
# Molmo 7B (best for vision)
docker compose -f docker/vllm/docker-compose.molmo.yml up -d
# Gemma 3 12B (best for reasoning)
docker compose -f docker/vllm/docker-compose.gemma.yml up -dEndpoints: http://localhost:8000/v1/chat/completions
For users without local GPUs or production workloads.
cd infrastructure/gcp
cp terraform.tfvars.example terraform.tfvars # Configure
terraform applyFeatures: L4 GPU (~$0.70/hr), auto-shutdown, IP-restricted firewall.
Once a backend is running, kanoa automatically detects it:
from kanoa import Interpreter
# Uses local backend (Ollama or vLLM) automatically
interpreter = Interpreter(backend="local")
# Interpret your matplotlib figure
result = interpreter.interpret(fig)
print(result.text)Or explicitly configure:
from kanoa.backends import VLLMBackend
backend = VLLMBackend(
api_base="http://localhost:8000/v1",
model="allenai/Molmo-7B-D-0924"
)kanoa-mlops extends the kanoa CLI with infrastructure commands:
# Initialize (for pip users)
kanoa mlops init --dir . # Scaffold docker templates
# Start services
kanoa mlops serve ollama # Start Ollama
kanoa mlops serve monitoring # Start Prometheus + Grafana
kanoa mlops serve all # Start everything
# Stop services
kanoa mlops stop # Stop all services
kanoa mlops stop ollama # Stop specific service
# Status
kanoa mlops status # Show config and running services
# Restart services
kanoa mlops restart ollama # Restart OllamaReal-time observability for your inference workloads:
kanoa mlops serve monitoring
# → Grafana: http://localhost:3000 (admin/admin)
# → Prometheus: http://localhost:9090Dashboard Features:
| Section | Metrics |
|---|---|
| Token Odometers | Cumulative prompt/generated tokens, request counts |
| Latency | TTFT and TPOT percentiles (p50, p90, p95, p99) |
| GPU Hardware | Temperature, power, utilization, memory (via NVIDIA DCGM) |
| vLLM Performance | KV cache usage, request queue, throughput |
See monitoring/README.md for full documentation.
kanoa-mlops/
├── kanoa_mlops/ # CLI plugin (serve, stop commands)
│ └── plugin.py
├── docker/
│ ├── vllm/ # vLLM Docker Compose configs
│ └── ollama/ # Ollama Docker Compose config
├── monitoring/
│ ├── grafana/ # Dashboards and provisioning
│ └── prometheus/ # Scrape configs
├── infrastructure/
│ └── gcp/ # Terraform for cloud GPU
├── examples/ # Jupyter notebooks
├── scripts/ # Model download utilities
└── tests/integration/ # Backend integration tests
| Model | Backend | Throughput | Notes |
|---|---|---|---|
| Molmo 7B | vLLM | 31.1 tok/s | Best for vision tasks |
| Gemma 3 12B | vLLM | 10.3 tok/s | Strong text reasoning |
| Gemma 3 4B | Ollama | ~15 tok/s | Good balance |
Why vLLM is faster:
- Continuous batching for concurrent requests
- PagedAttention for efficient KV cache
- FP8 quantization support
| Model | Size | vLLM | Ollama | Notes |
|---|---|---|---|---|
| Molmo 7B | 14GB | [✓] | — | Best vision performance |
| Gemma 3 | 4B-27B | [✓] | [✓] | Excellent all-rounder |
| Olmo 3 32B Think | 32GB | [✓] | — | Advanced reasoning, code generation |
| LLaVa-Next | 7B-34B | [ ] | [✓] | Planned for vLLM |
Llama 3.1, Mistral, Qwen 2.5, and 100+ more.
| Platform | Status | Notes |
|---|---|---|
| NVIDIA RTX (Desktop/Laptop) | [✓] Verified | RTX 3080+ recommended |
| NVIDIA RTX (eGPU) | [✓] Verified | TB3/TB4 bandwidth sufficient |
| NVIDIA Jetson Thor | [✓] Verified | 128GB Unified Memory, Blackwell GPU |
| Apple Silicon | [✓] Ollama | M1/M2/M3 via Ollama |
| GCP L4 GPU | [✓] Verified | 24GB VRAM, ~$0.70/hr |
| Intel/AMD GPU | — | Not supported |
kanoa-mlops is a plugin for the kanoa CLI. The kanoa package provides the CLI framework, and kanoa-mlops registers additional commands (serve, stop, restart) via Python entry points.
kanoa (CLI) ──loads──► kanoa-mlops (plugin)
│ │
└── entry points ◄───────┘
To develop both packages simultaneously, install both in editable mode:
# Clone both repos
git clone https://github.com/lhzn-io/kanoa.git
git clone https://github.com/lhzn-io/kanoa-mlops.git
# Create and activate environment
conda env create -f kanoa-mlops/environment.yml
conda activate kanoa-mlops
# Install BOTH packages in editable mode
pip install -e ./kanoa # Provides 'kanoa' CLI
pip install -e ./kanoa-mlops # Registers plugin commands
# Verify
kanoa --help # Should show: serve, stop, restartWhy both? The
kanoapackage provides the CLI entry point. Thekanoa-mlopspackage registers its commands as plugins. Both must be installed for the full CLI to work.
If you switch conda environments or commands are missing:
pip install -e /path/to/kanoa -e /path/to/kanoa-mlops- Docker and Docker Compose
- NVIDIA GPU + Drivers (for vLLM)
- Python 3.11+
WSL2/eGPU Users: See the Local GPU Setup Guide for platform-specific instructions.
- [✓] Ollama integration (Dec 2025)
- [✓] CLI plugin system (Dec 2025)
- [✓] NVIDIA DCGM monitoring (Dec 2025)
- [✓] NVIDIA Jetson Thor support (Dec 2025)
- PostgreSQL + pgvector for RAG
- Kubernetes / Helm charts
- NVIDIA Jetson Orin support
We welcome contributions! See CONTRIBUTING.md for guidelines.
- Adding New Models? Check out the Model Contribution Guide.
Pro Tip: We find Claude Code to be an excellent DevOps buddy for this project. If you use AI tools, just remember our Human-in-the-Loop policy.
MIT License — see LICENSE for details.