-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
This is essentially a carry on of #1354, in which the issue was mitigated but not truly solved.
Specifically, when you have multiple instances working on multiple queues, the problem described in the previous issue can still happen if you are simply unlucky with the timing.
Instead of simply bumping the amount of time a lock is held for, would a better solution here be for some actual 'ownership' of the lock to be conducted? Right now, the scheduler simply writes a pid into the cache value, but it could instead write e.g. a random, unique value that is associated at startup/construction time. Then when it comes to heartbeating the locks, it can sanity check that it itself actually does hold the ownership for the lock value and degrade gracefully if this assumption breaks down?
I've not thought it through in all that much detail, but I do think the current implementation is not solid enough for high-scale production workflows and I'd like to work through something with you to tighten it up.