Add STOPPING state #44
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Heroku's docs say:
django-db-queuecurrently handles theSIGTERMsignal by setting an internal flag on the worker classalivetoFalse. When the currently running job finishes, the worker then stops looping and exits gracefully.However, if a worker is processing a job that takes longer than 30 seconds, then the process never gets a chance to exit gracefully before
SIGKILLis received. The worker will be forcefully killed and the job will remain in thePROCESSINGstate forever.Under normal circumstances this doesn't cause too much of an issue (other than being a bit weird). These
PROCESSINGjobs just sit there, never being cleaned up (becausedelete_old_jobsdoesn't delete them) but also never being processed.However, recently we've started doing some queries on the state of the jobs table before creating a new job - eg "is there already a job of this type in the queue? If so, don't bother creating another one". To do this, we've naively checked for jobs in the states
NEWorPROCESSING. The problem is, if one of these zombiePROCESSINGjobs exists, the code thinks that one is always already in the queue - so the queue grinds to a halt! We've worked around this with more complex queries (like "is there a job in stateNEWor a job in statePROCESSINGthat's newer than two hours old") but they're a bit of a hack.This PR adds a new possible state to the
statefield on theJobmodel:STOPPING. The job is put into this state as soon as theSIGTERMsignal is received. Assuming the job then finishes within the 30 second window, it goes intoCOMPLETE(orFAILED) as normal. However, if the job doesn't finish in time, the job will stay in thisSTOPPINGstate forever.That means we can now distinguish between "this job is actually running" and "this job is actually running but has been asked to exit" (ie state
STOPPINGand less than 30 seconds old) and "this job has been asked to stop but never actually stopped and is therefore zombified" (stateSTOPPINGand more than 30 seconds old).I did some state machines to try to illustrate this.
Before:
(note that all nodes are double-circled because they are all possible "end states")
After:
Drawbacks of this approach:
STOPPINGmay of course still be running, so if the calling code is only looking for jobs in statePROCESSING, it doesn't guarantee that there isn't another job running already. So if it's critical that only one job of a certain type is running at a time, we have to go back to the workaround of looking forSTOPPINGjobs as well - but we can be much more constrained and only look forSTOPPINGjobs newer than 30 seconds old.SIGKILLwithout receiving aSIGTERMfirst, it will never have a chance to put the job intoSTOPPING, and so it could still get stuck inPROCESSING. This is unlikely, though.I think if we decide to merge this, it'll need to be version 3.0 as it's technically a change to the public API (ie people depending on the current behaviour may find their code broken).