-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
POC/RFC: linalg: low-level nD support #22838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e826156 to
049195f
Compare
|
Thank you for this. I get stomach aches by looking at Arpack these days. I'll get back to this once I got over my nausea. I can't say I can follow the OOP parts though. Seems like we have 10x boilerplate instead of 4x boilerplate now 😝 Let me finish that arpack thing up then we can go back to our skirmish about C vs. ++ Two things we better do with |
|
Thanks Ilhan. Would be great to detail your specific concerns about this, once you're safely back from your travels in the wondrous ARPACK-land of course :-). Meanwhile, the last commit
I don't think I'm using any OOP, and the ++ features are at minimum. In fact, it's pretty much all C (manual memory management, And I believe the boilerplate here is enough to very easily implement all LU-related functionality (det, lu, lu_{factor,solve}). Can do it either here or in a follow-up. One thing that bugs me still is error handling, gh-22476. |
While at it, refactor a bit to make code more C-like: - move npymath related helpers to their own header - simplify lapack_trampolines - refactor the inv_loop
ilayn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have to write a similar implementation to explain myself.
I feel that this is not fair to criticize over your work without putting any effort but just typing terse sentences that I know we will have a back and forth. So please take them lightly until I write something very arcane so you would have the fair chance to dunk on the shortcomings of what I do 😃
Regardless of what we do in the end, this is really more performant and cleaner in my opinion than doing bunch of get_lapack_funcs and losing lots of performance hidden away. And even in your version we only have 200ish lines and the rest is going to be recycled for other funcs. So overall, I think this is worth the effort.
| ) | ||
|
|
||
|
|
||
| py3.extension_module('_batched_linalg', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't do this as a static (and shared for BLAS/LAPACK dependency) library, other C code can't use it. That's why it is just a header file in common utils right now for sqrtm and eventually for others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it in a "jpeg loading" way: here's an extension module, self-contained. Once it has some more usage, we figure the common parts and make them shared (TBD if via a static library or a common header)
| #include "_npymath.h" | ||
|
|
||
|
|
||
| using namespace _numpymath; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this namespace _numpymath required? Can't we not template using the regular dependencies to NumPy without our own extra boilerplates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole _npymath.h header exists mainly to cover for what numpy/npy_math.h lacks, and because I work here with npy_cdouble directly without converting to C/C++ complex. This way there is no need forreinterpret_cast or anything.
Or is the question why it is in a namespace at all? Well, using namespace in a header is not a good practice indeed, so I should probably remove the namespacing. Or, better, put everything into a namespace.
scipy/linalg/src/_batched_linalg.h
Outdated
|
|
||
|
|
||
| // parroted from sqrtm | ||
| // XXX can probably be replaced by ?copy from BLAS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No unfortunately, this is basically to transpose a matrix in smaller chunks that fits into L1 cache to reduce the cache misses.
At least in theory. I think we should unroll these to 8 for reals and 4 for complexes.I need a proper godbolt session to get a feeling for what compilers are choosing to do. I'll get to it eventually.
dcopy is BLAS' memcpy. They do different things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression was that incx, incy arguments are essentially strides. But if not, great, I'll remove the comment.
| * Copy each slice into a buffer, make the buffer F-ordered | ||
| * for LAPACK. | ||
| */ | ||
| struct iter_data_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get your comment about not using OOP. This is pretty much a class with a single method. Instead of loop optimizations now you are converting it into a Python generator and yielding a slice here if I can read C++ correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err, no, it's exactly equivalent to the sqrtm looping. Even the names are the same :-).
The usage is (https://github.com/scipy/scipy/pull/22838/files#diff-4c915d9c644759eaa646244756e388af6a0d49f33d6546864b5cff09ce056431R264
iter_data_t iter_data(array_object); // grab shape/strides, compute the offset pointer
for (npy_intp idx=0; idx < iter_data.outer_size; idx++(
iter_data.copy_slice(idx, buffer); // copy slice `idx` into `buffer`
...
)
The only difference to sqrtm version is that here I transpose into the F order right away instead of copy first, swap_cf immediately after.
If having get_buffer as a method of a struct bothers you, no problem, I'll rewrite it as a free function
copy_slice(iter_data, idx, buffer). Will that cover your concerns?
scipy/linalg/src/_batched_linalg.h
Outdated
| /* | ||
| * Invert a 2D slice: | ||
| * - Input slice is in `getrf_data.a`. | ||
| * - The result is in `getri_data.a`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a class instance here. I really don't see the complication helping us here a class with a single method is pretty much a function. Still missing the call to gecon though.
| /* | ||
| * Hold the GESV related variables, handle allocation/deallocation. | ||
| */ | ||
| template<typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the temptation but this is clearly a C++ class. We need to justify the need for a class within us first.
If it is always a single instance then this is just unnecessary code making things complicated and holding bunch of variables together. I can't reuse anything inside this class it is just wasted memory. Say, I already have a larger integer array, I want to use it with a gesv call. Now you are taking away that possibility with this or making it 10x harder to achieve because then I have to overwrite this with dependency injection or whatever the terminology is.
If we had 100 instances of this entity, like a computer game with 100 characters, yes by all means let's class'ify it but for single instances Class usage is an abstraction that carries the burden of proof. Independent from C++, we also have such code in Python, equally difficult to keep track of. The most recent I'm working on is _arpack.py with three big classes and single instances. It is really not a great experience I can tell you to troubleshoot.
We just need three consecutive lapack calls for getrf/gecon/getri. Look at all the preparation code that goes into it. And I can't import it to a C code if we have to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's just a bag holder of LAPACK arguments, so instead of
getrf(&n, &n, a, &lda, ....)
you initialize the lda (once) and do getrf(getrf_data). To me, these have two advantages:
- cut down on having to finger count the LAPACK arguments (trivial for getrf, fun when there are 11 or so).
- when implementing, say, LU, I don't need to copy-paste the LAPACK call and double-check the variables. It's already done in the constructor.
The memory argument I get, but frankly, I don't see how it's relevant here. First, how on earth are you going to end up with a 100 simultaneously active lapack calls. This is a local variable, nothing more. Note that array allocations are separate.
Second, even if you end up with, say, 100 threads, each of which is running a LAPACK computation---so that you do have 100 getrf_data variables, you also have done 100 mallocs for the work arrays, and 100 matrices to factorize. Surely you have bigger issues with memory/CPU switching then 100 copies of five int variables.
So it is all down to stylistic preferences ISTM. From where I stand, it is not a deal-breaker, so how about either,
- rename the constructor into
init_getrffunction, or - ditch the structs and just call
?getrf(...)manually.
Would either of these address your concerns?
ev-br
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's try to converge on a version with minimum viable amount of boilerplate :-), so that we can proceed chipping away on get_lapack_funcs.
There are a couple of questions below, to try clarifying what kind of reworks would work for you. We could hop on a higher-bandwidth conversation, too, if that's helpful.
For me, the main thing is that currently the main loop, inv_loop is under 100 LOCs, with memory allocations and comments.
To keep it in that ballpark, we do need some sort of templating, be it template<typename T> or Tempita or even distutils templates [1] if push comes to shove.
[1] not my first choice, mind you :-).
| ) | ||
|
|
||
|
|
||
| py3.extension_module('_batched_linalg', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it in a "jpeg loading" way: here's an extension module, self-contained. Once it has some more usage, we figure the common parts and make them shared (TBD if via a static library or a common header)
| #include "_npymath.h" | ||
|
|
||
|
|
||
| using namespace _numpymath; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole _npymath.h header exists mainly to cover for what numpy/npy_math.h lacks, and because I work here with npy_cdouble directly without converting to C/C++ complex. This way there is no need forreinterpret_cast or anything.
Or is the question why it is in a namespace at all? Well, using namespace in a header is not a good practice indeed, so I should probably remove the namespacing. Or, better, put everything into a namespace.
scipy/linalg/src/_batched_linalg.h
Outdated
|
|
||
|
|
||
| // parroted from sqrtm | ||
| // XXX can probably be replaced by ?copy from BLAS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression was that incx, incy arguments are essentially strides. But if not, great, I'll remove the comment.
| * Copy each slice into a buffer, make the buffer F-ordered | ||
| * for LAPACK. | ||
| */ | ||
| struct iter_data_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err, no, it's exactly equivalent to the sqrtm looping. Even the names are the same :-).
The usage is (https://github.com/scipy/scipy/pull/22838/files#diff-4c915d9c644759eaa646244756e388af6a0d49f33d6546864b5cff09ce056431R264
iter_data_t iter_data(array_object); // grab shape/strides, compute the offset pointer
for (npy_intp idx=0; idx < iter_data.outer_size; idx++(
iter_data.copy_slice(idx, buffer); // copy slice `idx` into `buffer`
...
)
The only difference to sqrtm version is that here I transpose into the F order right away instead of copy first, swap_cf immediately after.
If having get_buffer as a method of a struct bothers you, no problem, I'll rewrite it as a free function
copy_slice(iter_data, idx, buffer). Will that cover your concerns?
| /* | ||
| * Hold the GESV related variables, handle allocation/deallocation. | ||
| */ | ||
| template<typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's just a bag holder of LAPACK arguments, so instead of
getrf(&n, &n, a, &lda, ....)
you initialize the lda (once) and do getrf(getrf_data). To me, these have two advantages:
- cut down on having to finger count the LAPACK arguments (trivial for getrf, fun when there are 11 or so).
- when implementing, say, LU, I don't need to copy-paste the LAPACK call and double-check the variables. It's already done in the constructor.
The memory argument I get, but frankly, I don't see how it's relevant here. First, how on earth are you going to end up with a 100 simultaneously active lapack calls. This is a local variable, nothing more. Note that array allocations are separate.
Second, even if you end up with, say, 100 threads, each of which is running a LAPACK computation---so that you do have 100 getrf_data variables, you also have done 100 mallocs for the work arrays, and 100 matrices to factorize. Surely you have bigger issues with memory/CPU switching then 100 copies of five int variables.
So it is all down to stylistic preferences ISTM. From where I stand, it is not a deal-breaker, so how about either,
- rename the constructor into
init_getrffunction, or - ditch the structs and just call
?getrf(...)manually.
Would either of these address your concerns?
- a 2D singular matrix raises - a batched matrix with all slices being singular: raises - a batched matrix with some slices singular and some not: fill the singular ones with nans
|
Okay, following up on #22838 (comment), 96c1831 ditches the |
|
Talking to np.errstate is an excellent idea if it is possible by the way which would also solve our dichotomy about #22476. I'll dig a bit about it. |
|
It would be great to coordinate with #28782 so there aren't multiple ways of handling errors in stacks of matrices |
|
superseded by #22924, closing |

Reference issue
towards #21466
supersedes and closes #21935
What does this implement/fix?
Add infrastructure for low-level batching in scipy.linalg.
This is an alternative to gh-21935, which copy-pasted the gufunc infrastructure from numpy. This PR, instead, does manual looping over the batch dimensions, with the iterator from
sqrtm, cc #22406 (comment)Similar to gh-21935, here I convert
inv. Which by itself is not a very interesting function; it's just simple enough to be useful as a guinea pig for the infrastructure. The PR looks large, but PRs for additional functions will be much smaller.Remaining TBDs and action items:
overwrite_anot done yet; probably worth tackling in a follow-up;Additional information
Quick-and-dirty performance measurements: basically, we are on par with numpy.linalg, which is to say 5-10x faster than the current scipy main for deep stacks of matrices of small core dimension.