Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Expose healthcheck data in Prometheus metrics #10678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #8971
johnstcn opened this issue Nov 14, 2023 · 5 comments
Closed
Tracked by #8971

Expose healthcheck data in Prometheus metrics #10678

johnstcn opened this issue Nov 14, 2023 · 5 comments

Comments

@johnstcn
Copy link
Member

As an operator, I would like to be able to view the [health data from Coder] (#8971) in Prometheus:

Something like the below could be useful:

coderd_health_access_url { healthy: [1|0], reachable: [1|0], status_code: [status_code], response_len: [response_len] }
coderd_health_database { healthy: [1|0], reachable: [1|0]], latency_ms: [latency_ms], threshold_ms: [threshold_ms] }
coderd_health_derp { region_id: [region_id], healthy: [1|0], round_trip_ping_ms: [round_trip_ping_ms], uses_websocket: [1|0], stun_enabled: [1|0], ... }
coderd_health_websocket { healthy: [1|0], response_len: [response_len], code: [code] }

This will allow me to answer questions such as:

  • At what periods does Coder notice the worst database latency?
  • Does Coder's access URL become unreachable at specific times?
  • Do any DERP regions report errors at specific times?
  • Do websocket requests fail at specific times?
@f0ssel
Copy link
Contributor

f0ssel commented Nov 28, 2023

Okay so I've taken a look at this and found a few interesting points:

  1. Right now prom metrics will be invalid until someone hits the endpoint, and only refreshed when someone hits the endpoint. If we want to have ready data we need a background process running this task instead of on demand from the request handler.
  2. If we want to make it a background job, we need to generate a valid API key that isn't tied to a user, since the websocket report currently hijacks the requesting user's api key. I'm not sure how to do this currently, but we'd need some sort of system level actor.

Given the refactor + API key issue, I want to ask if this still seems like a positive ROI on this work. I see the value in the metrics but given this was suppose to be a "quick and easy" one I want to make sure we want to do these changes required for this data.

CC @johnstcn @sreya

@f0ssel
Copy link
Contributor

f0ssel commented Nov 28, 2023

Work in progress PR here: #10921

@f0ssel
Copy link
Contributor

f0ssel commented Nov 29, 2023

So @sreya and I synced on this - given the complexity it adds to the codebase vs the value returned we don't think it's a good idea to do this right now. The background job work just changes the original feature functionality way too much to justify and we are worried it'll cause more bugs down the road.

@johnstcn given that reasoning, would you be okay if we closed this and we can reopen the PR if we see a bigger need in the future?

@johnstcn
Copy link
Member Author

@f0ssel sure, this was just something I thought could be useful but nobody has asked for yet to the best of my knowledge

@johnstcn johnstcn closed this as not planned Won't fix, can't repro, duplicate, stale Nov 29, 2023
@strike
Copy link
Contributor

strike commented Aug 9, 2024

At the very least, we need a mechanism (or something similar) that can report the status of various components in Coder.
We developed our own Prometheus exporter for this purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants