Description
Problem
Customers need to monitor for unexpected workspace disconnects to set up proactive alerts for service degradation. Current Prometheus metrics don't distinguish between:
- Expected disconnects: User intentionally closes session/stops workspace
- Unexpected disconnects: Network issues, agent crashes, infrastructure problems
The existing coderd_agents_connections
metric only shows agent-to-coderd connection status, not client-to-agent disconnects that users actually experience.
Current Limitations
coderd_agents_connections{status="disconnected"}
includes both graceful and ungraceful disconnectscoderd_agents_connections{status="timeout"}
only covers connection establishment timeouts- No metrics track client-to-agent session disconnects (SSH, VS Code, etc.)
Proposed Solution
Add new Prometheus metrics that leverage the coordinator's existing "graceful disconnect" concept:
coderd_agent_client_disconnects_total{type="graceful|ungraceful", agent_name, username, workspace_name}
This would enable customers to:
- Alert specifically on ungraceful disconnects:
rate(coderd_agent_client_disconnects_total{type="ungraceful"}[5m]) > threshold
- Monitor service health without noise from normal user behavior
- Distinguish infrastructure issues from user-initiated actions
Why This Matters
Customers deploying Coder at scale need reliable alerting for actual service degradation. Current metrics generate false positives from normal workspace stops, making it difficult to detect real issues that impact user productivity.
Reference: Customer ticket #3917