Thanks to visit codestin.com
Credit goes to github.com

Skip to content

start_http_server: start HTTPServer in main thread before handing off to daemon thread #102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

rud
Copy link

@rud rud commented Sep 27, 2016

This means if you call start_http_server with an already used port/addr combination, you can rescue the OSError, and try for a different port. With this, it becomes possible to probe a range of ports to find one that is available.

Idea for this comes from https://github.com/korfuri/django-prometheus/blob/2b6eac500cc9bea402a45f04ca7b63189889785a/django_prometheus/exports.py#L77-L88 - which makes it easy to let each uwsgi worker listen on a port by automatically trying a whole range of ports and picking one that works.

… to daemon thread

This means if you call start_http_server with an already used port/addr combination, you can rescue the OSError, and try for a different port. With this, it becomes possible to probe a range of ports to find one that is available.

Idea for this comes from https://github.com/korfuri/django-prometheus/blob/2b6eac500cc9bea402a45f04ca7b63189889785a/django_prometheus/exports.py#L77-L88 - which makes it easy to let each uwsgi worker listen on a port by automatically trying a whole range of ports and picking one that works.
@rud
Copy link
Author

rud commented Sep 27, 2016

A thing worth discussing: it would be entirely possible to rescue the OSError and just return a boolean to indicate whether starting the listener succeeded. Given that the previous behaviour was to silently not serve metrics for the current process (instead of halting with an exception like this introduces), a smaller change would be to add a rescue here - and existing setups with duplicate metrics ports would still fail silently.

Thoughts?

@brian-brazil
Copy link
Contributor

There's also been requests to be able to stop the httpserver, so it'd be best to consider that with this change.

@rud
Copy link
Author

rud commented Sep 27, 2016

Hi @brian-brazil,

As I see it, one option would be to return the handle of the daemon thread from the start_http_server method, or save it off somewhere for future reference. Would that be a handy enough API?

Alternatively, the return value could be the thread-handle on successful service start, None on startup failure. What do you think?

@rud
Copy link
Author

rud commented Sep 27, 2016

I see, you probably refer to #76 - but as I read that issue, the user ended up finding a different way to start/stop their listener?

@rud
Copy link
Author

rud commented Sep 27, 2016

@brian-brazil fwiw, I've created code in my project to just reuse the prometheus_client.MetricsHandler to setup my own HTTP listener just like I want it (with automatic probing of a number of ports until a free one is found).

Thank you for making the prometheus_client.MetricsHandler reusable externally, I can only imagine how many variations people need for running their services.

Feel free to close this pull-request if the design is not something you can use going forward.

@brian-brazil
Copy link
Contributor

The PR as-is is good idea, the question is more around rehandling exceptions. The Pythonic way is to just let them be thrown.

(with automatic probing of a number of ports until a free one is found).

That smells a bit, your config management should be telling you what port number to use.

Thank you for making the prometheus_client.MetricsHandler reusable externally, I can only imagine how many variations people need for running their services.

That's the idea.

@rud
Copy link
Author

rud commented Sep 27, 2016

I concur with the port probing being potentially problematic, but given that a uwsgi master process spawns a number of child processes at various times, it seems to make sense they each pick a port in a known range where they make metrics available. I'm using the Telegraf daemon to connect to each of these ports and collect the available metrics. There is a simplicity to this I like - and since to the best of my knowledge each uwsgi worker-process should not care too much about its individual identity/number in the flock, it also makes sense that it cannot have a distinct static prometheus listener port.

@rud
Copy link
Author

rud commented Sep 27, 2016

So, any specific changes you'd like to see to this code at this time?

@brian-brazil
Copy link
Contributor

Sounds like you should be reading #66, that's not a safe way to do multi-process.

@rud
Copy link
Author

rud commented Sep 28, 2016

Thank you for your suggestion.

In #66 I see a way of having workers report their metrics up the process tree to the master process, but it is currently a work in progress. I do not see any discussion of safety properties that preclude individual workers from exposing their own metrics directly, but it may very well be too implicit for me to see it. Is it related to the point in the startup process where workers are spawned, that some internal structure may or may not be correctly setup yet?

If this is veering off-topic I do apologise.

@brian-brazil
Copy link
Contributor

The issue is that there might be state in a dead worker that you want to preserve past its demise, and that correctly collapsing per-process counters requires data from dead workers too.

@rud
Copy link
Author

rud commented Sep 28, 2016

Agreed, that would indeed be loss of recent state in this design. And since the uwsgi master process will much too happily kill -9 workers that are past their prime - dataloss will be guaranteed in that case if the collector did not sweep in soon enough.

To me that comes to the trade off of sending metrics somewhere external immediately while handling an incoming request (to guarantee survivability of metrics, but adding latency), or to buffer up measurements within each worker-process for a configurable short time, and gather the values in bulk every X seconds. All in-memory queueing means the potential for dataloss, but at much greater throughput. Adding additional communication complexity to the master/worker hierarchy as #66 does seems like it might be a good tradeoff, but things do become more difficult to reason about.

I'll close this merge-request now, as I think I have a grasp of the tradeoffs I'm making, and I think you are right the solution provided here is not generally applicable.

@rud rud closed this Sep 28, 2016
@rud rud deleted the feature/start-listener-in-main-thread-to-allow-retry branch January 14, 2024 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants