-
-
Notifications
You must be signed in to change notification settings - Fork 32k
multiprocessing.Process.is_alive()
can incorrectly return True after join()
#130895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This bug is caused by race conditions in the poll implementations (which are called by join/wait) where if multiple threads try to reap the dead process only one "wins" and gets the exit code, while the others get an error. In the forkserver implementation the losing thread(s) set the code to an error, possibly overwriting the correct code set by the winning thread. This is relatively easy to fix: we can just take a lock before waiting for the process, since at that point we know the call should not block. In the fork and spawn implementations the losers of the race return before the exit code is set, meaning the process may still report itself as alive after join returns. Fixing this is trickier as we have to support a mixture of blocking and non-blocking calls to poll, and we cannot have the latter waiting to take a lock held by the former. The approach taken is to split the blocking and non-blocking call variants. The non-blocking variant does its work with the lock held: since it won't block this should be safe. The blocking variant releases the lock before making the blocking operating system call. It then retakes the lock and either sets the code if it wins or waits for a potentially racing thread to do so otherwise. If a non-blocking call is racing with the unlocked part of a blocking call it may still "lose" the race, and return None instead of the exit code, even though the process is dead. However, as the process could be alive at the time the call is made but die immediately afterwards, this situation should already be handled by correctly written code. To verify the behaviour a test is added which reliably triggers failures for all three implementations. A work-around for this bug in a test added for pythongh-128041 is also reverted.
This bug is caused by race conditions in the poll implementations (which are called by join/wait) where if multiple threads try to reap the dead process only one "wins" and gets the exit code, while the others get an error. In the forkserver implementation the losing thread(s) set the code to an error, possibly overwriting the correct code set by the winning thread. This is relatively easy to fix: we can just take a lock before waiting for the process, since at that point we know the call should not block. In the fork and spawn implementations the losers of the race return before the exit code is set, meaning the process may still report itself as alive after join returns. Fixing this is trickier as we have to support a mixture of blocking and non-blocking calls to poll, and we cannot have the latter waiting to take a lock held by the former. The approach taken is to split the blocking and non-blocking call variants. The non-blocking variant does its work with the lock held: since it won't block this should be safe. The blocking variant releases the lock before making the blocking operating system call. It then retakes the lock and either sets the code if it wins or waits for a potentially racing thread to do so otherwise. If a non-blocking call is racing with the unlocked part of a blocking call it may still "lose" the race, and return None instead of the exit code, even though the process is dead. However, as the process could be alive at the time the call is made but die immediately afterwards, this situation should already be handled by correctly written code. To verify the behaviour a test is added which reliably triggers failures for all three implementations. A work-around for this bug in a test added for pythongh-128041 is also reverted.
I've managed to reproduce the bug on Linux for all three implementations, although note that forkserver fails in a different way from the others (potentially returns an incorrect error code rather than the process seeming alive). Fixing it is a little tricky for the fork/spawn implementations as we don't want blocking calls to block non-blocking calls, so can't just naively take a lock and hold it across the operating system call. Hopefully my fix handles everything correctly: it is still possible for a non-blocking call racing with a blocking one to "incorrectly" return indicating the process is alive, however as the process could die at any time calling code should already be prepared to handle this. Note that the implementations rely on Reliable (on my machine) reproduction of the bug for all implementations on Linuximport multiprocessing as mp
import os
import sys
import threading
THREAD_COUNT=2
MAX_ITERATIONS=10
original_excepthook = threading.excepthook
def on_except(args):
original_excepthook(args)
os._exit(1)
threading.excepthook = on_except
def runner(barrier):
barrier.wait()
def waiter(barrier, proc):
barrier.wait()
proc.join()
assert not proc.is_alive()
assert proc.exitcode == 0
if __name__ == '__main__':
if len(sys.argv) > 1:
mp.set_start_method(sys.argv[1])
for _ in range(MAX_ITERATIONS):
barrier = mp.Barrier(THREAD_COUNT+1)
proc = mp.Process(target=runner, args=(barrier,))
threads = [threading.Thread(target=waiter, args=(barrier, proc)) for _ in range(THREAD_COUNT)]
for t in threads:
t.start()
proc.start()
for t in threads:
t.join() |
Uh oh!
There was an error while loading. Please reload this page.
Bug report
Bug description:
This came up in #130849 (comment)
The problem is that
popen_fork.Popen
(andpopen_spawn.Popen
andpopen_forkserver.Popen
) are not thread-safe:cpython/Lib/multiprocessing/popen_fork.py
Lines 25 to 35 in 02de9cb
The first successful call to
os.waitpid()
may reap the pid so that subsequent calls raise anOSError
. I've only seen this on macOS (not Linux). We may not yet however have setself.returncode
-- that happens a few statements later, sopoll()
can returnNone
if:self.returncode
And then
is_alive()
can return True:cpython/Lib/multiprocessing/process.py
Lines 153 to 170 in 02de9cb
Note that some classes like
concurrent.futures.ProcessPoolExecutor
use threads internally, so the user may not even know that threads are involved.Repro:
repro.py
NOTE:
popen_fork.Popen
(and subclasses) are distinct fromsubprocess.Popen
CPython versions tested on:
CPython main branch
Operating systems tested on:
macOS
Linked PRs
The text was updated successfully, but these errors were encountered: