Thanks to visit codestin.com
Credit goes to github.com

Skip to content

multiprocessing.Process.is_alive() can incorrectly return True after join() #130895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
colesbury opened this issue Mar 5, 2025 · 1 comment
Labels
stdlib Python modules in the Lib dir topic-multiprocessing type-bug An unexpected behavior, bug, or error

Comments

@colesbury
Copy link
Contributor

colesbury commented Mar 5, 2025

Bug report

Bug description:

This came up in #130849 (comment)

The problem is that popen_fork.Popen (and popen_spawn.Popen and popen_forkserver.Popen) are not thread-safe:

def poll(self, flag=os.WNOHANG):
if self.returncode is None:
try:
pid, sts = os.waitpid(self.pid, flag)
except OSError:
# Child process not yet created. See #1731717
# e.errno == errno.ECHILD == 10
return None
if pid == self.pid:
self.returncode = os.waitstatus_to_exitcode(sts)
return self.returncode

The first successful call to os.waitpid() may reap the pid so that subsequent calls raise an OSError. I've only seen this on macOS (not Linux). We may not yet however have set self.returncode -- that happens a few statements later, so poll() can return None if:

  1. The process has finished
  2. Another thread called poll(), but hasn't yet set self.returncode

And then is_alive() can return True:

def is_alive(self):
'''
Return whether process is alive
'''
self._check_closed()
if self is _current_process:
return True
assert self._parent_pid == os.getpid(), 'can only test a child process'
if self._popen is None:
return False
returncode = self._popen.poll()
if returncode is None:
return True
else:
_children.discard(self)
return False

Note that some classes like concurrent.futures.ProcessPoolExecutor use threads internally, so the user may not even know that threads are involved.

Repro:

repro.py
import os
import multiprocessing as mp
import threading
import time
import sys

original_excepthook = threading.excepthook

def on_except(args):
    original_excepthook(args)
    os._exit(1)

threading.excepthook = on_except

def p1():
    pass

def thread1(p):
    while p.is_alive():
        time.sleep(0.00001)
        pass

def test():
    for i in range(1000):
        print(i)
        p = mp.Process(target=p1)
        p.start()

        t = threading.Thread(target=thread1, args=(p,))
        t.start()

        p.join()
        assert not p.is_alive()

        t.join()

def main():
    threads = [threading.Thread(target=test) for _ in range(10)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

if __name__ == "__main__":
    main()

NOTE:

  • This is unrelated to free threading
  • popen_fork.Popen (and subclasses) are distinct from subprocess.Popen

CPython versions tested on:

CPython main branch

Operating systems tested on:

macOS

Linked PRs

@colesbury colesbury added topic-multiprocessing type-bug An unexpected behavior, bug, or error labels Mar 5, 2025
@picnixz picnixz added the stdlib Python modules in the Lib dir label Mar 7, 2025
duaneg added a commit to duaneg/cpython that referenced this issue Mar 19, 2025
This bug is caused by race conditions in the poll implementations (which are
called by join/wait) where if multiple threads try to reap the dead process
only one "wins" and gets the exit code, while the others get an error.

In the forkserver implementation the losing thread(s) set the code to an error,
possibly overwriting the correct code set by the winning thread. This is
relatively easy to fix: we can just take a lock before waiting for the process,
since at that point we know the call should not block.

In the fork and spawn implementations the losers of the race return before the
exit code is set, meaning the process may still report itself as alive after
join returns. Fixing this is trickier as we have to support a mixture of
blocking and non-blocking calls to poll, and we cannot have the latter waiting
to take a lock held by the former.

The approach taken is to split the blocking and non-blocking call variants. The
non-blocking variant does its work with the lock held: since it won't block
this should be safe. The blocking variant releases the lock before making the
blocking operating system call. It then retakes the lock and either sets the
code if it wins or waits for a potentially racing thread to do so otherwise.

If a non-blocking call is racing with the unlocked part of a blocking call it
may still "lose" the race, and return None instead of the exit code, even
though the process is dead. However, as the process could be alive at the time
the call is made but die immediately afterwards, this situation should already
be handled by correctly written code.

To verify the behaviour a test is added which reliably triggers failures for
all three implementations. A work-around for this bug in a test added for
pythongh-128041 is also reverted.
duaneg added a commit to duaneg/cpython that referenced this issue Mar 19, 2025
This bug is caused by race conditions in the poll implementations (which are
called by join/wait) where if multiple threads try to reap the dead process
only one "wins" and gets the exit code, while the others get an error.

In the forkserver implementation the losing thread(s) set the code to an error,
possibly overwriting the correct code set by the winning thread. This is
relatively easy to fix: we can just take a lock before waiting for the process,
since at that point we know the call should not block.

In the fork and spawn implementations the losers of the race return before the
exit code is set, meaning the process may still report itself as alive after
join returns. Fixing this is trickier as we have to support a mixture of
blocking and non-blocking calls to poll, and we cannot have the latter waiting
to take a lock held by the former.

The approach taken is to split the blocking and non-blocking call variants. The
non-blocking variant does its work with the lock held: since it won't block
this should be safe. The blocking variant releases the lock before making the
blocking operating system call. It then retakes the lock and either sets the
code if it wins or waits for a potentially racing thread to do so otherwise.

If a non-blocking call is racing with the unlocked part of a blocking call it
may still "lose" the race, and return None instead of the exit code, even
though the process is dead. However, as the process could be alive at the time
the call is made but die immediately afterwards, this situation should already
be handled by correctly written code.

To verify the behaviour a test is added which reliably triggers failures for
all three implementations. A work-around for this bug in a test added for
pythongh-128041 is also reverted.
@duaneg
Copy link
Contributor

duaneg commented Mar 19, 2025

I've managed to reproduce the bug on Linux for all three implementations, although note that forkserver fails in a different way from the others (potentially returns an incorrect error code rather than the process seeming alive).

Fixing it is a little tricky for the fork/spawn implementations as we don't want blocking calls to block non-blocking calls, so can't just naively take a lock and hold it across the operating system call. Hopefully my fix handles everything correctly: it is still possible for a non-blocking call racing with a blocking one to "incorrectly" return indicating the process is alive, however as the process could die at any time calling code should already be prepared to handle this.

Note that the implementations rely on multiprocessing.connection.wait to determine whether a call to poll would block. I have checked this and believe it should be thread-safe: the call will never block after wait says it is ready no matter what other threads do in the interim, although it may return an error instead of the exit code for threads losing the race. However if that is mistaken we could potentially have "non-blocking" calls hanging indefinitely: if possible someone with more experience with the multiprocessing code should double-check that.

Reliable (on my machine) reproduction of the bug for all implementations on Linux
import multiprocessing as mp
import os
import sys
import threading

THREAD_COUNT=2
MAX_ITERATIONS=10

original_excepthook = threading.excepthook

def on_except(args):
    original_excepthook(args)
    os._exit(1)

threading.excepthook = on_except

def runner(barrier):
    barrier.wait()

def waiter(barrier, proc):
    barrier.wait()
    proc.join()
    assert not proc.is_alive()
    assert proc.exitcode == 0

if __name__ == '__main__':
    if len(sys.argv) > 1:
        mp.set_start_method(sys.argv[1])

    for _ in range(MAX_ITERATIONS):
        barrier = mp.Barrier(THREAD_COUNT+1)
        proc = mp.Process(target=runner, args=(barrier,))

        threads = [threading.Thread(target=waiter, args=(barrier, proc)) for _ in range(THREAD_COUNT)]
        for t in threads:
            t.start()

        proc.start()
        for t in threads:
            t.join()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-multiprocessing type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants