Telemetry stack for my personal GKE cluster - Prometheus + Grafana + other bits to get useful data out of this.
I've deliberately used CoreOS' Prometheus Operator as I recognise how useful this is, but stopped short of deploying the full CoreOS kube-prometheus stack as - great concept though it is - I want to learn about how this stuff hangs together.
This uses the CoreOS Prometheus Operator and tweaked (namespace, resources, labels). Had to add --config-reloader-cpu=20m to fit it on my tiny cluster! There is no equivalent for the prometheus-config-reloader in the current Operator sadly, but just setting it for config-reloader was sufficient to get me up and running.
The operator CRDs are installed from a different repo as the ServiceMonitor resource needs to exist for several other pipelines to work successfully.
Prometheus itself is defined in (./prometheus/). I skimped on a dedicated StorageClass to save myself a few quid.
When this is up and running, you should be able to kubectl port-forward svc/prometheus-operated 9090:9090 and then hit http://localhost:9090/ and see one of the Promethei.
When this is up and running, you should be able to kubectl port-forward svc/alertmanager-operated 9093:9093 and then hit http://localhost:9093/ and see one of the Promethei.
Basic Auth has been replaced with Google's Identity Aware Proxy across my shared Gateway. If looking for that config, look at commits before October 2023.
Set up has been done manually for now - may revisit this later. General gist is:
- Enable IAP via the Console
- find the Backend Service for this workload in the IAP panel and toggle IAP to On
The always-firing alert is helpful for testing - taking a copy of it and changing its receiver for example. I've found that editing the AlertManager config to define a new receiver block with a much shorter repeat interval to be the best approach for me:
- receiver: testing
group_by: [group]
repeat_interval: 1m
group_interval: 1m
matchers:
- receiver="testing"... then configuring an appropriate receiver matching that name to test with.
There is also a ./fire-test-alert.sh script which is occasionally useful - and needs port-forwarding to AlertManager to work.
Is installed via an operator - see ./grafana-operator/generate-manifest.sh. We apply the raw manifest generated locally via helm template.
... then CI takes care of the deployment via kustomize. The operator is deployed in namespaced mode, meaning any grafana resources (including dashboards) are only looked for in the grafana namespace.
- Grafana Operator
- Controller to generate ServiceMonitors for apps
- remove servicemonitors from other repos
- ensure secret part of project migration
- AlertManager access via ingress with auth
- A default alert handler for no routes
- Tests for this!
- Dashboard for it being called
- Alerts for:
- Pods not scheduled
- Crash Loop Backoff
- 404
- 5xx
- Velero backup failed
- [-] No longer using cert-manager Certs about to expire
- High latency alert
- OOMKill
- PV Filling
- Slack integration
- Slack integration improvements to formatting - own app?
- Dashboard links for all alerts
- Grafana login through Google Account
- Remove the old terraformed stuff
- Grafana plugins experimentation
- Stackdriver