-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
MNT: Reorganize non-constant global statics into structs #26607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aca2e25
to
95a592b
Compare
On my system I see that
When I test on this PR I don't see any significant timing difference in importing numpy, as expected given the above analysis. |
I ran the full benchmark suite. It looks like there are some performance changes but it's kind of obnoxious to pick out reproducible changes from the big list So far I've found:
float32 partition
see the full list here: https://gist.github.com/ngoldbaum/7e8ee9a129a96a32536f228e4214018b |
I tried comparing |
if (npy_ma_str_current_allocator == NULL) { | ||
// this is module-level global heap allocation, it is currently | ||
// never freed | ||
npy_ma_str = PyMem_Calloc(sizeof(npy_ma_str_struct), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this (and the others like it) check that it is not called twice?
Looks like a move in the right direction to me. I wonder if the performance changes if you statically allocate the structs rather than calling |
Good point! It does seem to help, at least on Linux. There are still some heap allocations though, I still need to look closer at whether they can be made static. |
b5464df
to
f3643a2
Compare
So the latest version of this PR drops moving things into the Earlier I said I compared
Here f3643a2 is the most recent commit in this PR and 9e40ee24 is the commit I generated with a no-op I'm going to start a full benchmark run like this to determine which benchmarks have random results and I will ignore those for determining whether or not this PR has a performance impact. |
I see what the bug is there. diff --git a/benchmarks/benchmarks/bench_linalg.py b/benchmarks/benchmarks/bench_linalg.py
index 3077357237..f3eb819c18 100644
--- a/benchmarks/benchmarks/bench_linalg.py
+++ b/benchmarks/benchmarks/bench_linalg.py
@@ -72,7 +72,7 @@ def time_tensordot_a_b_axes_1_0_0_1(self):
class Linalg(Benchmark):
- params = set(TYPES1) - set(['float16'])
+ params = sorted(list(set(TYPES1) - set(['float16'])))
param_names = ['dtype']
def setup(self, typename): |
Nice! That makes sense. This is just one benchmark I could run quickly to prove there's an issue, I'll take a look at the full list of randomly changing benchmarks to see if there are similar problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. I am still a bit saddened by having these giant init functions and wonder if we shouldn't do a local init pattern instead for some things.
I may look into some maintanence here in general, I think we can invent some cuter patterns (even if it adds a bit of complexity in the helpers).
I also think we should just drop the _ma_
for "multiarray", it is a leftover of when multiarray and ufunc were two modules.
I.e. we could shorten it to npy_static.<...>
even.
union { | ||
npy_uint8 bytes[8]; | ||
npy_uint64 uint64; | ||
} unpack_lookup_big[256]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it makes sense to split out the non-objects because at some point, I assume that modules may need to decref all of these (or implement a tp_traverse, but that is the same thing).
Also, this table for example is truly static even with subinterpreters. The only issue is initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this table for example is truly static even with subinterpreters. The only issue is initialization.
Yup, everything in this struct is static after module initialization. I guess if we ever supported subinterpreters someone could make this be initialized once for all subinterpreters or just do it every time. I don't think it makes a ton of difference...
npy_ma_str___dlpack__ = PyUnicode_InternFromString("__dlpack__"); | ||
if (npy_ma_str___dlpack__ == NULL) { | ||
npy_ma_str.__dlpack__ = PyUnicode_InternFromString("__dlpack__"); | ||
if (npy_ma_str.__dlpack__ == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some thoughts on how to shrink this code (even if a bit tricky 20 lines, I feel it might be nice). I'll make a PR later.
I also think we should move this into it's own file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think we should move this into it's own file.
Sorry, just to clarify - what should be moved into its own file? Filling all the static structs? Or just the string interning?
For the former, fair enough, that occured to me, and it would separate multiarraymodule.h from the global statics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say all of the static ones in one file maybe? Seemed worthwhile to try to use this opportunity to clean things up a bit nicer (but I am happy to help with it!)
e8861b3
to
8d7d57c
Compare
I just split out the static data structs into their own file. I left the thread unsafe state struct in multiarraymodule.h since I don't think it makes sense to live in the new header and I like how it makes it clear when code accesses thread unsafe state and also to make it clearer where the state is defined. To justify that choice: I completely missed some global state on the first pass of this PR and added it to the thread unsafe state struct in this pass. I looked at initializing e.g. the ArrayMethod objects used in the casts in a module init and I felt like it wasn't any clearer to initialize locally in an initialization function inside the same file as where the global is used or to initialize it along with all the other cached globals and it sort of feel like a nice pattern to centralize this stuff. All IMO of course. I've been looking at benchmarking closely the past few days and I think all the benchmark results I shared earlier for this PR are noise. Many of the things that show up are due to bugs I fixed in #26637, #26638, and #26639. The rest I suspect are due to jitter on the laptop I was using to run the benchmarks as well as asv caching results. So far I haven't been able to find a single performance change reported by asv that is reproduce outside of the asv environment or that persists if I purge the asv results database or make a no-op empty commit to test with. I'm starting an asv run with higher-than-default settings for |
8d7d57c
to
eb55252
Compare
eb55252
to
2bf1f1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a look through with some nitpicky comments. Thanks for reorganizing into a new file, and the macro's for strings and imports are also at least much more compact!
Overall, I think this is good to go for me, unfortunately it might create merge conflicts pretty quickly, but I guess we'll just have to deal with them (for backporting)?
(There are the nitpicky comments and a merge conflict, but I can also just make a pass and merge if you think it's ready and it is clear that we are merging now.)
* struct { | ||
* atomic_int initialized; | ||
* PyObject *value; | ||
* } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, so assuming that the value is never NULL
after initialization, grabbing the lock and double checking for value == NULL
is not valid (i.e. there is no safe pattern to do it)?
Because if that worked, the above seems unnecessary bloat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to use at least one atomic load otherwise it's possible that the compiler might reorder things like:
// original code
if (flag) {
do_some_work();
do_more_work();
}
// after optimization
if (flag) {
do_some_work();
}
if (flag) {
do_more_work()
}
because data races are UB. In this case it would just lead to a memory leak. For the argparse cache I think it could cause two threads to simultaneously run initialize_keywords
, corrupting the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suspected that much, was just wondering if the atomic_int
is strictly necessary (or there is a way to do an atomic_load(value)
, I guess).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no it's not necessary, I can do it with an atomic load instead.
39daec6
to
41c7f43
Compare
It seems gcc generates code that seg faults if you try to write to a field of a const static struct via a non-const pointer so unfortunately I don't think I can easily make the structs const. I could have two sets of structs - const structs used only to read from that are filled in at the end of initialization and non-const structs filled in during initialization - but that didn't seem worth the additional complexity to me. |
I think this is ready to merge now. Since this will definitely conflict with the Also if there are worries about generating lots of conflicts to the 2.0 maintenance branch I'd be happy to hold off on merging this until we're a little closer to the 2.1 maintenance branch being created. I can continue to update this as I make more things thread-safe. |
Co-authored-by: Sebastian Berg <[email protected]>
41c7f43
to
3ae66b1
Compare
We discussed this at the community meeting and agreed to merge it now, so I'm pulling this in. Thanks all for the reviews! |
This reorganizes most of the mutable static globals in numpy into four structs exposed to the internal API via
multiarraymodule.h
npy_interned_str
for interned stringsnpy_static_pydata
for immutable PyObjects that are initialized during module initializationnpy_static_cdata
for immutable C data that is initialized during module initializationnpy_ma_thread_unsafe_state
for state stored in a thread-unsafe mannerWith the goal of refactoring the items in
npy_ma_thread_unsafe_state
to be thread-safe in followup PRs. See also the tracking issue for items that still need to be fixed.