Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Kubernetes with Flux at home

License

Notifications You must be signed in to change notification settings

solanyn/home-ops

Repository files navigation

πŸš€ My Home Operations Repository 🚧

... managed with Flux, Renovate, and GitHub Actions πŸ€–

TalosΒ Β  KubernetesΒ Β  FluxΒ Β 

Renovate

Home-InternetΒ Β  Status-PageΒ Β  Alertmanager

Age-DaysΒ Β  Uptime-DaysΒ Β  Node-CountΒ Β  Pod-CountΒ Β  CPU-UsageΒ Β  Memory-UsageΒ Β  Alerts


πŸ’‘ Overview

This is a mono repository for my home infrastructure and Kubernetes cluster. It doubles as a production-grade platform engineering reference, running enterprise tooling on bare metal to validate real-world patterns for AI/ML platforms, data engineering and platform operations.

The cluster manages ~100 applications across 24 namespaces, covering everything from a full Kubeflow ML platform and real-time streaming pipelines to home automation and media services -- all deployed through GitOps with FluxCD.

There is a template at onedr0p/cluster-template if you want to follow along with some of the practices used here.


🌱 Platform Capabilities

AI/ML Platform

A complete machine learning platform built on Kubeflow, providing end-to-end ML lifecycle management from data annotation through to production model serving:

  • kserve and knative for serverless model inference with autoscaling and scale-to-zero
  • kgateway as an AI-native API gateway with LLM routing, MCP server integration and security policies
  • kuberay-operator for distributed training and hyperparameter tuning
  • spark-operator for large-scale data processing within ML pipelines
  • katib for automated hyperparameter tuning and neural architecture search
  • feast as a feature store for online/offline feature serving
  • mlflow for experiment tracking and model registry
  • label-studio for data annotation and dataset preparation
  • agent-sandbox for sandboxed AI agent execution

Data Engineering & Streaming

Production data infrastructure for real-time and batch processing:

  • kafka for event streaming and real-time data ingestion
  • flink for stateful stream processing and real-time analytics
  • trino as a distributed SQL query engine across heterogeneous data sources
  • argo-workflows for DAG-based workflow orchestration

Networking & Service Mesh

Dual-stack IPv4/IPv6 networking with BGP-based load balancing and Kubernetes Gateway API:

  • cilium as the CNI with eBPF-based network policies, BGP peering and L2/L3 load balancing
  • envoy gateway with Kubernetes Gateway API for north-south traffic management
  • kgateway for AI/LLM-aware routing with MCP tool proxying
  • multus for cross-VLAN pod networking
  • external-dns for automated split-horizon DNS across Cloudflare and UniFi

Security & Identity

Zero-trust security model with policy enforcement and centralised identity:

Observability

Full-stack monitoring with long-term metric retention, distributed tracing and LLM observability:

Storage & Databases

Distributed and local storage with operator-managed databases:

  • rook-ceph for distributed block, object and filesystem storage
  • cloudnative-pg for production PostgreSQL with automated backups and failover
  • dragonfly as a high-performance Redis-compatible in-memory store
  • garage for S3-compatible distributed object storage (backups, Thanos, CNPG WAL archival)
  • mariadb operator for MySQL-compatible workloads
  • influxdb for time-series data and IoT metrics
  • vernemq as an MQTT broker for IoT device communication
  • openebs for local PV provisioning
  • volsync and kopia for encrypted backup orchestration

Multi-Cluster & Cloud Burst

On-demand GPU/CPU capacity via workload offloading to GKE:

  • liqo for transparent multi-cluster workload offloading over WireGuard
  • crossplane for declarative GKE cluster provisioning as Kubernetes CRDs
  • One-way offloading from home cluster to GKE for GPU workloads (L4, A100)
  • Autoscaling node pools with scale-to-zero when idle (spot instances for cost efficiency)
  • No Flux/ESO on GKE -- entire remote cluster managed via Crossplane from home

Infrastructure Provisioning & GitOps

Declarative cluster management with dependency-aware deployments:

  • flux for Git-based state reconciliation with drift detection and self-healing
  • renovate for automated dependency updates across the entire repository
  • actions-runner-controller for self-hosted CI/CD runners
  • keda for event-driven autoscaling
  • spegel for peer-to-peer OCI image distribution

πŸ—‚ Repository Structure

πŸ“ kubernetes
β”œβ”€β”€ πŸ“ apps           # applications across 24 namespaces
β”œβ”€β”€ πŸ“ components     # reusable Kustomize components (volsync, alerts, nfs-scaler)
└── πŸ“ flux           # Flux system configuration
πŸ“ talos              # Talos Linux node configuration (Jinja2 templates)
πŸ“ bootstrap          # cluster bootstrapping resources

Dependency Management

Applications deploy in dependency order based on infrastructure requirements, preventing race conditions.

graph TD
    A>Kustomization: rook-ceph] -->|Creates| B[HelmRelease: rook-ceph]
    A>Kustomization: rook-ceph] -->|Creates| C[HelmRelease: rook-ceph-cluster]
    C>HelmRelease: rook-ceph-cluster] -->|Depends on| B>HelmRelease: rook-ceph]
    D>Kustomization: atuin] -->|Creates| E(HelmRelease: atuin)
    E>HelmRelease: atuin] -->|Depends on| C>HelmRelease: rook-ceph-cluster]
Loading

😢 Hybrid Cloud Strategy

The setup maximises self-hosted infrastructure whilst using cloud services where appropriate.

Service Use Cost (AUD)
1Password Secrets with External Secrets ~$50/yr
Cloudflare Domains and S3 ~$30/yr
GitHub Hosting this repository and continuous integration/deployments Free
Pushover Kubernetes Alerts and application notifications $5 OTP
healthchecks.io Monitoring internet connectivity and external facing applications Free
Total: ~$7/mo

🌎 DNS Architecture

The cluster implements automated split-horizon DNS across multiple zones:

  • Internal zone management via UniFi controller integration using webhook providers
  • Public DNS automation with Cloudflare API integration
  • Dynamic DNS updates for public IP tracking via cloudflare-ddns
  • Traffic segmentation through gateway-based routing (envoy-internal/envoy-external)
  • Zero-touch operations with automatic record lifecycle management

This pattern enables secure service exposure whilst maintaining internal network isolation.


βš™ Hardware

Device OS Disk Data Disk Memory OS Function
Dell Optiplex 7050 Samsung PM863 960GB Micron 7450 Pro 960GB 32GB Talos Kubernetes
Dell Optiplex 7060 Samsung PM863 960GB Micron 7450 Pro 960GB 32GB Talos Kubernetes
Dell Optiplex 7060 Samsung PM863 960GB Micron 7450 Pro 960GB 32GB Talos Kubernetes
NAS (Repurposed PC) 512GB 1x12TB ZFS 16GB TrueNAS SCALE NFS + Backup Server
UniFi UCG Ultra - - - - Router

πŸ™ Gratitude and Thanks

Thanks to all the people who donate their time to the Home Operations Discord community. Be sure to check out kubesearch.dev for ideas on how to deploy applications or get ideas on what you could deploy.

Sponsor this project

 

Contributors 2

  •  
  •