-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
NOGIL: Make loop data cache and dispatch cache thread-safe in nogil build #26348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There is really no need to disable these, I think? This is fully static data once created, so the only place where you should need a lock, is if there is a cache miss and you are creating it!? The custom hash-map is a bit trickier, IIRC it gave a ~15% speed improvement at the time for small array ufunc calls. I am not really attached to it though, since it adds a bit of annoying complexity. OTOH, I am starting to wonder: how hard would it be to vendor a simple lock for these types of cases (i.e. a read lock that only blocks writes and a write block that blocks everything, but we expect a write block to very rarely take place)!? I feel a bit like this is may just the beginning, since at some point you probably have to think about locking the Array object (or |
I didn't try to fully understand what was going on, but the failure mode was
I'm hoping that Without doing something for these two caches then a lot of operations using numpy in parallel higher in the PyData stack don't work either, so merging this with a slowdown for single-threaded workflows should also unblock work on fixing issues in other libraries. In particular, this PR at least fixes a scikit-learn parallel forest training test and a scipy concurrent minimization test. |
@rgommers just pointed out to me I wasn't clear initially: both changes only affect single-threaded performance in the nogil build. All the changed code is guarded by ifdefs. |
Yeah, I hadn't noticed that initially and then realized when you commented, changes should be fine with that in any case (but should look over the dispaching part briefly maybe). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, a bit hacky to have a dict path in a custom object here, but I think it is fine especially since all of these are really stop-gaps for having a better mutex available.
I do think the test code needs a small fix (I think this was a manual test function to inspect the casts). No need to bother to actually test it on my account though.
} | ||
Py_INCREF(result); | ||
// caller handles error and missing key cases | ||
PyArrayIdentityHash_GetItemRef(tb, keys, &result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this isn't actually used, I think I used it for a more manual test maybe? But I think GetItemRef
can set an error probably? So I think you might need a result == NULL: PyErr_Clear(); result = Py_NewRef(Py_None);
?
numpy/_core/src/umath/dispatching.c
Outdated
PyTuple_GET_ITEM(info, 1), &PyArrayMethod_Type)) { | ||
/* Found the ArrayMethod and NOT a promoter: return it */ | ||
// TODO properly manage refcounts, this is safe because | ||
// PyArrayMethodObject instances are never decref'd otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK, just to note that in theory NumPy could replace them, but in practice it cannot happen right now. I.e. the TODO may be actually be a very very subtle bug.
I do not think we should assume this, unless we are absolutely sure we will never expose this cache to any ufuncs outside of numpy and will never write ufuncs that get deallocated ourselves (my https://github.com/mhvk/chain_ufunc certainly assumes short-lived ufuncs are a useful concept). |
|
||
|
||
#ifndef Py_GIL_DISABLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a comment here about the difference, and about how ideally we have a solution that works for both builds yet does not reduce performance when the GIL is enabled.
Note that ideally we indeed do not have our own hash tables - @seberg and I discussed other options at length when this code was added, but couldn't find an alternative that gave the same improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, about the comment: Maybe it makes sense to add something like:
/* Once cheap locking is exposed in Python, this should be used instead see gh-... */
Ufuncs never get deallocated
I do not think we should assume th
Yeah, we shouldn't. But, I think it was more of an observation here rather than very important for the change. I think the change actually prepares for making things more resiliant to an ArrayMethod
getting deleted (ufunc lifetime shouldn't matter much, as it cannot be deallocateed while in execution).
Although, I guess the observation is still true in NumPy, so deallocating ufuncs is possible not tested (well), unfortunately.
68df7e4
to
86e39a0
Compare
Thanks for looking this over! I realized I didn't have a good intuition for the performance impact of the dict solution vs a I spent some time on this yesterday and discovered that using https://gist.github.com/ngoldbaum/a858dbe28b90c1b06583af6fe6a45640 Take these performance measurements worth a grain of salt. I'm running this on my laptop and it has CPU throttling and other settings enabled that will impact this. For example, I'm pretty sure all of the reports in the PyMutex results are noise and the actual single-threaded impact of PyMutex is below the measurement threshold because I can't reproduce any of the changes when I run each benchmark individually. I've written the version with |
PyArrayIdentityHash_SetItem(tb, item+1, item[0], 1); | ||
PyObject **tb_item = find_item(tb, item + 1); | ||
tb_item[0] = item[0]; | ||
memcpy(tb_item+1, item+1, tb->key_len * sizeof(PyObject *)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This avoids a recursive call to _resize_if_necessary
and then acquiring the lock twice. It might also be a teeny bit faster.
#define FREE_LOCK(tb) \ | ||
if (tb->mutex != NULL) { \ | ||
PyThread_free_lock(tb->mutex); \ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These last two don't need to be macros but writing it this way lets me avoid adding more Py_GIL_DISABLED
checks below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rather elegant! And looks all good, but in case someone else asks for changes, I'd recommend adding a comment like /* Does nothing if GIL enabled */
after INITIALIZE_TABLE(tb)
and LOCK_TABLE(tb)
. Though I think it is really quiet clear, so if there are no other comments, that's not worth rerunning CI for.
Looks good to me as well, thanks for checking the mutex approach, Nathan. |
This disables the loop data cache and adds locking to the custom hash table storing the dispatch cache. As a result, this fixes seg faults and that can happen when these caches are corrupted by calling a ufunc simultaneously in multiple threads, as in the test I added.
This impacts single-threaded performance, but nothing catastrophic, see the comment below with up-to-date benchmarks.