This is a mono repository for my home infrastructure and Kubernetes cluster. It doubles as a production-grade platform engineering reference, running enterprise tooling on bare metal to validate real-world patterns for AI/ML platforms, data engineering and platform operations.
The cluster manages ~100 applications across 24 namespaces, covering everything from a full Kubeflow ML platform and real-time streaming pipelines to home automation and media services -- all deployed through GitOps with FluxCD.
There is a template at onedr0p/cluster-template if you want to follow along with some of the practices used here.
A complete machine learning platform built on Kubeflow, providing end-to-end ML lifecycle management from data annotation through to production model serving:
- kserve and knative for serverless model inference with autoscaling and scale-to-zero
- kgateway as an AI-native API gateway with LLM routing, MCP server integration and security policies
- kuberay-operator for distributed training and hyperparameter tuning
- spark-operator for large-scale data processing within ML pipelines
- katib for automated hyperparameter tuning and neural architecture search
- feast as a feature store for online/offline feature serving
- mlflow for experiment tracking and model registry
- label-studio for data annotation and dataset preparation
- agent-sandbox for sandboxed AI agent execution
Production data infrastructure for real-time and batch processing:
- kafka for event streaming and real-time data ingestion
- flink for stateful stream processing and real-time analytics
- trino as a distributed SQL query engine across heterogeneous data sources
- argo-workflows for DAG-based workflow orchestration
Dual-stack IPv4/IPv6 networking with BGP-based load balancing and Kubernetes Gateway API:
- cilium as the CNI with eBPF-based network policies, BGP peering and L2/L3 load balancing
- envoy gateway with Kubernetes Gateway API for north-south traffic management
- kgateway for AI/LLM-aware routing with MCP tool proxying
- multus for cross-VLAN pod networking
- external-dns for automated split-horizon DNS across Cloudflare and UniFi
Zero-trust security model with policy enforcement and centralised identity:
- pocket-id as the OIDC provider with passkey-based SSO (no passwords)
- external-secrets with 1Password Connect for secret injection
- cert-manager for automated TLS certificate lifecycle
Full-stack monitoring with long-term metric retention, distributed tracing and LLM observability:
- prometheus via kube-prometheus-stack for metrics collection and alerting
- thanos for highly available Prometheus with long-term object storage
- grafana for dashboarding across metrics, logs and traces
- opentelemetry collector with eBPF auto-instrumentation for distributed tracing
- clickhouse for high-performance trace and log storage
- langfuse for LLM observability, prompt management and evaluation
- victoria-logs and fluent-bit for log aggregation
- gatus for endpoint health monitoring and status pages
- blackbox-exporter, smartctl-exporter and unpoller for infrastructure probing
- silence-operator and kromgo for alert management and badge generation
Distributed and local storage with operator-managed databases:
- rook-ceph for distributed block, object and filesystem storage
- cloudnative-pg for production PostgreSQL with automated backups and failover
- dragonfly as a high-performance Redis-compatible in-memory store
- garage for S3-compatible distributed object storage (backups, Thanos, CNPG WAL archival)
- mariadb operator for MySQL-compatible workloads
- influxdb for time-series data and IoT metrics
- vernemq as an MQTT broker for IoT device communication
- openebs for local PV provisioning
- volsync and kopia for encrypted backup orchestration
On-demand GPU/CPU capacity via workload offloading to GKE:
- liqo for transparent multi-cluster workload offloading over WireGuard
- crossplane for declarative GKE cluster provisioning as Kubernetes CRDs
- One-way offloading from home cluster to GKE for GPU workloads (L4, A100)
- Autoscaling node pools with scale-to-zero when idle (spot instances for cost efficiency)
- No Flux/ESO on GKE -- entire remote cluster managed via Crossplane from home
Declarative cluster management with dependency-aware deployments:
- flux for Git-based state reconciliation with drift detection and self-healing
- renovate for automated dependency updates across the entire repository
- actions-runner-controller for self-hosted CI/CD runners
- keda for event-driven autoscaling
- spegel for peer-to-peer OCI image distribution
π kubernetes
βββ π apps # applications across 24 namespaces
βββ π components # reusable Kustomize components (volsync, alerts, nfs-scaler)
βββ π flux # Flux system configuration
π talos # Talos Linux node configuration (Jinja2 templates)
π bootstrap # cluster bootstrapping resourcesApplications deploy in dependency order based on infrastructure requirements, preventing race conditions.
graph TD
A>Kustomization: rook-ceph] -->|Creates| B[HelmRelease: rook-ceph]
A>Kustomization: rook-ceph] -->|Creates| C[HelmRelease: rook-ceph-cluster]
C>HelmRelease: rook-ceph-cluster] -->|Depends on| B>HelmRelease: rook-ceph]
D>Kustomization: atuin] -->|Creates| E(HelmRelease: atuin)
E>HelmRelease: atuin] -->|Depends on| C>HelmRelease: rook-ceph-cluster]
The setup maximises self-hosted infrastructure whilst using cloud services where appropriate.
| Service | Use | Cost (AUD) |
|---|---|---|
| 1Password | Secrets with External Secrets | ~$50/yr |
| Cloudflare | Domains and S3 | ~$30/yr |
| GitHub | Hosting this repository and continuous integration/deployments | Free |
| Pushover | Kubernetes Alerts and application notifications | $5 OTP |
| healthchecks.io | Monitoring internet connectivity and external facing applications | Free |
| Total: ~$7/mo |
The cluster implements automated split-horizon DNS across multiple zones:
- Internal zone management via UniFi controller integration using webhook providers
- Public DNS automation with Cloudflare API integration
- Dynamic DNS updates for public IP tracking via cloudflare-ddns
- Traffic segmentation through gateway-based routing (
envoy-internal/envoy-external) - Zero-touch operations with automatic record lifecycle management
This pattern enables secure service exposure whilst maintaining internal network isolation.
| Device | OS Disk | Data Disk | Memory | OS | Function |
|---|---|---|---|---|---|
| Dell Optiplex 7050 | Samsung PM863 960GB | Micron 7450 Pro 960GB | 32GB | Talos | Kubernetes |
| Dell Optiplex 7060 | Samsung PM863 960GB | Micron 7450 Pro 960GB | 32GB | Talos | Kubernetes |
| Dell Optiplex 7060 | Samsung PM863 960GB | Micron 7450 Pro 960GB | 32GB | Talos | Kubernetes |
| NAS (Repurposed PC) | 512GB | 1x12TB ZFS | 16GB | TrueNAS SCALE | NFS + Backup Server |
| UniFi UCG Ultra | - | - | - | - | Router |
Thanks to all the people who donate their time to the Home Operations Discord community. Be sure to check out kubesearch.dev for ideas on how to deploy applications or get ideas on what you could deploy.