ENH: add locking around initializing the argparse cache #26430

ngoldbaum · 2024-05-13T21:02:57Z

The argparse cache relies on a struct that gets statically defined in each function using npy_parse_arguments. The struct caches the interned string names of keyword arguments. Up until now, the GIL prevented data races while initializing the cache. Nothing prevents the same python function from being simultaneously called in multiple threads in the free-threaded build, so we need something a little more sophisticated there.

It turns out that the symbols in the up-until-now private pyatomic.h header are now exported by Python.h starting in Python 3.13. I've used some of atomic operations to implement an internal-only single initialization API, npy_call_once and then use that API in the argument parser.

There are some other caches like this, so I've factored things so that I can re-use npy_call_once elsewhere in the codebase. Because the pyatomic symbols are only exposed starting in Python3.13, I've decided to only make use of npy_call_once in the free-threaded build, everything is guarded by preprocessor checks for Py_GIL_DISABLED and this should be a no-op change for the regular build.

I re-implemented this using a single atomic flag and a PyThread_type_lock mutex, per Sam's suggestion.

ngoldbaum · 2024-05-13T21:33:53Z

I just got feedback that using the private pyatomic API isn't a good idea and I should rewrite this to use C11 atomics directly. I'll also need to see if that causes issues on Windows since MSVC added C11 atomics in visual studio 2022, and several of the windows builds use visual studio 2019.

charris · 2024-05-13T21:49:27Z

several of the windows builds use visual studio 2019

If we want to upgrade to 2022, starting the process now would be a good idea. @h-vetinari Are there constraints on doing this?

colesbury · 2024-05-13T22:04:33Z

If you need additional atomics beyond this, then I'd consider figuring out how to use VS Code 2022. Otherwise, I'd rearchitect this a bit as follows:

In each _NpyArgParserCache you have an atomic_int initialized;. On MSVC, you can #define atomic_int volatile int. Use this for the fast path check.

You also have a single PyThread_type_lock that protects the slow path. You don't need a PyThread_type_lock per _NpyArgParserCache, just a single global one. The lack of concurrency for initializing argument parsing caches doesn't matter. You need to initialize the lock early on with PyThread_allocate_lock before any arguments are parsed.

So in pseudo-code it would look something like:

#ifdef __STDC_NO_ATOMICS__
#define atomic_int volatile int
#else
#include <stdatomic.h>
#endif

static PyThread_type_lock global_cache_lock;

typedef struct {
    atomic_int initialized;
    ...
} _NpyArgParserCache;

// cache initialization
if (!cache->initialized) {
    PyThread_acquire_lock(global_cache_lock, 1);
    if (!cache->initialized) {
        initialize_keywords(cache);
        cache->initialized = 1;
    }
    PyThread_release_lock(global_cache_lock);
}

// module initialization
global_cache_lock = PyThread_allocate_lock();

EDIT: fixed bug in pseudo-code

h-vetinari · 2024-05-13T22:19:39Z

If we want to upgrade to 2022, starting the process now would be a good idea. @h-vetinari Are there constraints on doing this?

VS2019 is now EOL, so you'd have every right to move on. I'm guessing MSFT will also start removing it from public CI offerings in the not-too-distant future (if they follow a similar timeline as for VS2019), at which point it becomes a moot question anyway.

What it means is that all people wanting to compile against compiled numpy artefacts (in the past that was mostly libnpymath and libnpyrandom; not sure what the status of #20880 is) on windows will also have to use VS2022, due to the requirement that the consuming toolchain be at least as new as the producing one.

And in this case, it would probably also force conda-forge to move on, as numpy has ~1000 compiled dependencies, and at that point it becomes a pretty unavoidable. I can bring this up in the core call, but my personal opinion is that you should not make suboptimal engineering choices just to support EOL toolchains (obviously a question of degree, but certainly if the win is substantial).

As VS2022 is ABI-compatible with VS2019 and a drop-in replacement, I really don't see much of an issue arising from the toolchain version virality (I didn't hear of any relevant case in the transition from VS2017->VS2019 either, though I asked around for feedback).

CC @rgommers

rgommers · 2024-05-14T05:32:57Z

And in this case, it would probably also force conda-forge to move on, as numpy has ~1000 compiled dependencies, and at that point it becomes a pretty unavoidable.

Not necessarily, the only large user of the libnpymath static library that I know of is SciPy. And recently we've gotten rid of libnpyrandom in SciPy, so there are zero known users for that one.

not sure what the status of #20880 is

There is a path and a mostly working prototype to remove libnpymath. As you'd expect, that is fairly complex though and it's been on hold for a couple of months now.

If you need additional atomics beyond this, then I'd consider figuring out how to use VS Code 2022.

If we have significant needs here for supporting the free-threaded build, then I guess we have to bite the bullet.

I've used some of atomic operations to implement an internal-only single initialization API,

@ngoldbaum since we are not using atomics now, please keep in mind that we will need build config changes before this PR goes in. libatomic is shipped as a separate library by GCC instead of as part of libgcc on some platforms like armv7. That should be handled like so:

https://github.com/scipy/scipy/blob/be0d4263c61a36f177f3ec764d20fcfc6dcb7d0d/scipy/meson.build#L439-L476

ngoldbaum · 2024-05-15T22:14:52Z

I tried using a single global mutex initialized during module setup, but I found that spawned threads saw uninitialized mutexes. Maybe this issue is specific to PyThread_type_lock? It does work if each cache has a mutex though, so I'm going to try pushing that after integrating Ralf's meson suggestion.

ngoldbaum · 2024-05-16T17:32:52Z

It turns out the issue I was seeing was caused by initializing the lock in the _multiarray_umath module but not the _multiarray_tests module, so the test I added using a function in that module was seeing uninitialized data. Now we're using a global lock following Sam's original suggestion.

h-vetinari · 2024-05-16T22:59:57Z

If we have significant needs here for supporting the free-threaded build, then I guess we have to bite the bullet.

FWIW, we discussed this in conda-forge/core, and I invited people to comment here if they care - AFAICT there was no opposition or concern about numpy using VS2022. We might even be able to stay on vs2019, because under the hood, we're using vs2022 already, but targeting the vc142 toolset. Will depend how that shakes out, but in any case, don't worry about conda-forge w.r.t. this.

ngoldbaum · 2024-05-23T14:58:46Z

@rgommers I never tried running CI on this PR without the build changes and I know you were interested in taking a look. Should I try to see if it's actually necessary? I know we do test via QEMU on ARM with gcc so at least a few of the emulated tests should fail without the libatomic detection if it is necessary.

rgommers · 2024-05-23T15:10:14Z

That may be interesting to do once indeed, yes. Now that I look at this again: the reason for the uint64_t was (IIRC) that libatomic as a separate dependency was specifically necessary when you're locking objects containing 64-bit elements on 32-bit platforms.

colesbury · 2024-05-23T15:35:26Z

numpy/_core/src/common/npy_argparse.c

@@ -12,6 +12,16 @@

 #include "arrayfunction_override.h"

+static PyThread_type_lock global_mutex;
+
+int init_argparse_mutex() {


Should this be: int init_argparse_mutex(void)?

colesbury · 2024-05-23T15:36:50Z

numpy/_core/src/common/npy_argparse.h

    /* Null terminated list of keyword argument name strings */
    PyObject *kw_strings[_NPY_MAX_KWARGS+1];
 } _NpyArgParserCache;

+NPY_NO_EXPORT int init_argparse_mutex();


init_argparse_mutex(void);

colesbury · 2024-05-23T15:39:22Z

numpy/meson.build

@@ -214,6 +214,43 @@ else
  lapack_dep = declare_dependency(dependencies: [lapack, blas_dep])
 endif

+# Determine whether it is necessary to link libatomic. This could be the case
+# e.g. on 32-bit platforms when atomic operations are used on 64-bit types.


RISC-V with GCC is weird in that it supports 64-bit atomic ops without -latomic, but not 8-bit or 16-bit ops.

So in CPython we had to adjust the test to check for both 64-bit and 8-bit atomic operations:
https://github.com/python/cpython/blob/406ffb5293a8c9ca315bf63de1ee36a9b33f9aaf/configure.ac#L7417-L7422

If you are only using atomic_int then it might not matter.

colesbury · 2024-05-23T15:48:36Z

numpy/_core/src/common/npy_argparse.h

@@ -4,6 +4,13 @@
 #include <Python.h>
 #include "numpy/ndarraytypes.h"

+#ifdef __STDC_NO_ATOMICS__
+#define atomic_int volatile int


Caveat: this is "good enough" for x86 MSVC and also Window's "ARM64EC" platform target, but not other ARM64 targets on Windows.

If we need to support other Windows ARM64 targets, we'll need a bit more work. Hopefully, MSVC will support <stdatomic.h> soon, but I wouldn't hold my breath for it.

https://learn.microsoft.com/en-us/cpp/build/reference/volatile-volatile-keyword-interpretation

Thanks for clarifying! We don't test or build wheels for windows ARM64 right now, but I think we might need to support _M_ARM64 (see e.g. #22530). According to the MSVC docs, ARM64EC sets _M_X64.

I see pyatomic_msc.h has:

static inline uint64_t _Py_atomic_load_uint64(const uint64_t *obj) { #if defined(_M_X64) || defined(_M_IX86) return *(volatile uint64_t *)obj; #elif defined(_M_ARM64) return (uint64_t)__ldar64((unsigned __int64 volatile *)obj); #else # error "no implementation of _Py_atomic_load_uint64" #endif }

And storing is implemented using e.g. _InterlockedCompareExchange64 from intrin.h.

That all seems reasonable to include in NumPy to support this.

Even with the experimental atomics support in MSVC 2022, they still set _ STDC_NO_ATOMICS _, so we'd need to go out of our way to detect and use it.

ngoldbaum · 2024-06-21T21:35:25Z

Following up in #26780

ngoldbaum added the 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703) label May 13, 2024

github-actions bot added the 01 - Enhancement label May 13, 2024

ngoldbaum force-pushed the nogil-single-init branch from f265014 to 9293a4d Compare May 15, 2024 22:28

ngoldbaum changed the title ~~ENH: add single-initialization API and use it for the argparse cache~~ ENH: add locking around initializing the argparse cache May 15, 2024

ngoldbaum added 3 commits May 16, 2024 11:24

TST: add a test to check if the argparse cache is thread safe

2961406

NOGIL: lock initializing the argparse cache

cc94991

BLD: import libatomic linking code from scipy

9a3b8cc

ngoldbaum force-pushed the nogil-single-init branch from 9293a4d to 9a3b8cc Compare May 16, 2024 17:31

ngoldbaum mentioned this pull request May 16, 2024

ENH: Support free-threaded python build (tracking issue) #26157

Closed

16 tasks

colesbury reviewed May 23, 2024

View reviewed changes

ngoldbaum mentioned this pull request Jun 3, 2024

MNT: Reorganize non-constant global statics into structs #26607

Merged

rgommers mentioned this pull request Jun 14, 2024

BUG: error Python 3.13 free-threaded RuntimeError: Identity cache already includes the item. #26690

Closed

ngoldbaum mentioned this pull request Jun 21, 2024

MAINT: use an atomic load/store and a mutex to initialize the argparse and runtime import caches #26780

Merged

ngoldbaum closed this Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add locking around initializing the argparse cache #26430

ENH: add locking around initializing the argparse cache #26430

ngoldbaum commented May 13, 2024 •

edited

Loading

ngoldbaum commented May 13, 2024

charris commented May 13, 2024

colesbury commented May 13, 2024 •

edited

Loading

h-vetinari commented May 13, 2024 •

edited

Loading

rgommers commented May 14, 2024

ngoldbaum commented May 15, 2024

ngoldbaum commented May 16, 2024

h-vetinari commented May 16, 2024

ngoldbaum commented May 23, 2024

rgommers commented May 23, 2024

colesbury May 23, 2024

colesbury May 23, 2024

colesbury May 23, 2024

colesbury May 23, 2024

ngoldbaum May 23, 2024 •

edited

Loading

ngoldbaum commented Jun 21, 2024

ENH: add locking around initializing the argparse cache #26430

ENH: add locking around initializing the argparse cache #26430

Conversation

ngoldbaum commented May 13, 2024 • edited Loading

ngoldbaum commented May 13, 2024

charris commented May 13, 2024

colesbury commented May 13, 2024 • edited Loading

h-vetinari commented May 13, 2024 • edited Loading

rgommers commented May 14, 2024

ngoldbaum commented May 15, 2024

ngoldbaum commented May 16, 2024

h-vetinari commented May 16, 2024

ngoldbaum commented May 23, 2024

rgommers commented May 23, 2024

colesbury May 23, 2024

Choose a reason for hiding this comment

colesbury May 23, 2024

Choose a reason for hiding this comment

colesbury May 23, 2024

Choose a reason for hiding this comment

colesbury May 23, 2024

Choose a reason for hiding this comment

ngoldbaum May 23, 2024 • edited Loading

Choose a reason for hiding this comment

ngoldbaum commented Jun 21, 2024

ngoldbaum commented May 13, 2024 •

edited

Loading

colesbury commented May 13, 2024 •

edited

Loading

h-vetinari commented May 13, 2024 •

edited

Loading

ngoldbaum May 23, 2024 •

edited

Loading