Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ev-br
Copy link
Member

@ev-br ev-br commented Apr 14, 2025

Reference issue

towards #21466
supersedes and closes #21935

What does this implement/fix?

Add infrastructure for low-level batching in scipy.linalg.

This is an alternative to gh-21935, which copy-pasted the gufunc infrastructure from numpy. This PR, instead, does manual looping over the batch dimensions, with the iterator from sqrtm, cc #22406 (comment)

Similar to gh-21935, here I convert inv. Which by itself is not a very interesting function; it's just simple enough to be useful as a guinea pig for the infrastructure. The PR looks large, but PRs for additional functions will be much smaller.

Remaining TBDs and action items:

Additional information

Quick-and-dirty performance measurements: basically, we are on par with numpy.linalg, which is to say 5-10x faster than the current scipy main for deep stacks of matrices of small core dimension.

In [1]: from scipy.linalg import inv

In [2]: from scipy.linalg._basic import inv0

In [3]: import numpy as np

In [4]: n = 10

In [5]: a = np.ones((n//2, n//5, n//10, n, n), dtype=float) + 8*np.eye(n)

In [6]: %timeit np.linalg.inv(a)
26.4 μs ± 409 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [7]: %timeit inv(a)
23.4 μs ± 187 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [8]: %timeit inv0(a)
211 μs ± 834 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@ev-br ev-br added enhancement A new feature or improvement scipy.linalg labels Apr 14, 2025
@ev-br ev-br requested review from ilayn and larsoner as code owners April 14, 2025 20:08
@github-actions github-actions bot added C/C++ Items related to the internal C/C++ code base Meson Items related to the introduction of Meson as the new build system for SciPy RFC Request for Comments; typically used to gather feedback for a substantial change proposal labels Apr 14, 2025
@ev-br ev-br force-pushed the batched_inv2 branch 2 times, most recently from e826156 to 049195f Compare April 14, 2025 21:40
@lucascolley lucascolley changed the title POC/RFC: low-level nD support in scipy.linalg POC/RFC: linalg: low-level nD support Apr 15, 2025
@ilayn
Copy link
Member

ilayn commented Apr 15, 2025

Thank you for this. I get stomach aches by looking at Arpack these days. I'll get back to this once I got over my nausea. I can't say I can follow the OOP parts though. Seems like we have 10x boilerplate instead of 4x boilerplate now 😝 Let me finish that arpack thing up then we can go back to our skirmish about C vs. ++

Two things we better do with inv is recycling the malloc'd buffer in every spin and injecting gecon call in the middle of getrf and getri

@ev-br
Copy link
Member Author

ev-br commented Apr 17, 2025

Thanks Ilhan. Would be great to detail your specific concerns about this, once you're safely back from your travels in the wondrous ARPACK-land of course :-).

Meanwhile, the last commit

  • adds overwrite_a=True in a backwards compatible way;
  • refactors the code a bit to make it more C-like and hopefully easier to follow.

I don't think I'm using any OOP, and the ++ features are at minimum. In fact, it's pretty much all C (manual memory management, goto done etc etc) with a few basic templates to cut down in 4x duplication of algorithmic parts.

And I believe the boilerplate here is enough to very easily implement all LU-related functionality (det, lu, lu_{factor,solve}). Can do it either here or in a follow-up.

One thing that bugs me still is error handling, gh-22476. numpy.linalg does something that their gufuncs talk to np.errstate. This is hidden from the user in numpy (https://github.com/numpy/numpy/blob/v2.2.0/numpy/linalg/_linalg.py#L607), so maybe the right thing for scipy.linalg is to use a sensible default (raise in 2D, fill with nans for ndim > 2) and make it talk to np.errstate, so that users can use the context manager. I'll take a look at how hard it is to make it a thing.

Copy link
Member

@ilayn ilayn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have to write a similar implementation to explain myself.

I feel that this is not fair to criticize over your work without putting any effort but just typing terse sentences that I know we will have a back and forth. So please take them lightly until I write something very arcane so you would have the fair chance to dunk on the shortcomings of what I do 😃

Regardless of what we do in the end, this is really more performant and cleaner in my opinion than doing bunch of get_lapack_funcs and losing lots of performance hidden away. And even in your version we only have 200ish lines and the rest is going to be recycled for other funcs. So overall, I think this is worth the effort.

)


py3.extension_module('_batched_linalg',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't do this as a static (and shared for BLAS/LAPACK dependency) library, other C code can't use it. That's why it is just a header file in common utils right now for sqrtm and eventually for others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it in a "jpeg loading" way: here's an extension module, self-contained. Once it has some more usage, we figure the common parts and make them shared (TBD if via a static library or a common header)

#include "_npymath.h"


using namespace _numpymath;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this namespace _numpymath required? Can't we not template using the regular dependencies to NumPy without our own extra boilerplates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole _npymath.h header exists mainly to cover for what numpy/npy_math.h lacks, and because I work here with npy_cdouble directly without converting to C/C++ complex. This way there is no need forreinterpret_cast or anything.

Or is the question why it is in a namespace at all? Well, using namespace in a header is not a good practice indeed, so I should probably remove the namespacing. Or, better, put everything into a namespace.



// parroted from sqrtm
// XXX can probably be replaced by ?copy from BLAS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No unfortunately, this is basically to transpose a matrix in smaller chunks that fits into L1 cache to reduce the cache misses.

At least in theory. I think we should unroll these to 8 for reals and 4 for complexes.I need a proper godbolt session to get a feeling for what compilers are choosing to do. I'll get to it eventually.

dcopy is BLAS' memcpy. They do different things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression was that incx, incy arguments are essentially strides. But if not, great, I'll remove the comment.

* Copy each slice into a buffer, make the buffer F-ordered
* for LAPACK.
*/
struct iter_data_t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get your comment about not using OOP. This is pretty much a class with a single method. Instead of loop optimizations now you are converting it into a Python generator and yielding a slice here if I can read C++ correctly.

Copy link
Member Author

@ev-br ev-br Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, no, it's exactly equivalent to the sqrtm looping. Even the names are the same :-).
The usage is (https://github.com/scipy/scipy/pull/22838/files#diff-4c915d9c644759eaa646244756e388af6a0d49f33d6546864b5cff09ce056431R264

iter_data_t iter_data(array_object);    // grab shape/strides, compute the offset pointer

for (npy_intp idx=0; idx < iter_data.outer_size; idx++(
    iter_data.copy_slice(idx, buffer);              // copy slice `idx` into `buffer`
    ...
) 

The only difference to sqrtm version is that here I transpose into the F order right away instead of copy first, swap_cf immediately after.

If having get_buffer as a method of a struct bothers you, no problem, I'll rewrite it as a free function
copy_slice(iter_data, idx, buffer). Will that cover your concerns?

/*
* Invert a 2D slice:
* - Input slice is in `getrf_data.a`.
* - The result is in `getri_data.a`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a class instance here. I really don't see the complication helping us here a class with a single method is pretty much a function. Still missing the call to gecon though.

/*
* Hold the GESV related variables, handle allocation/deallocation.
*/
template<typename T>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the temptation but this is clearly a C++ class. We need to justify the need for a class within us first.

If it is always a single instance then this is just unnecessary code making things complicated and holding bunch of variables together. I can't reuse anything inside this class it is just wasted memory. Say, I already have a larger integer array, I want to use it with a gesv call. Now you are taking away that possibility with this or making it 10x harder to achieve because then I have to overwrite this with dependency injection or whatever the terminology is.

If we had 100 instances of this entity, like a computer game with 100 characters, yes by all means let's class'ify it but for single instances Class usage is an abstraction that carries the burden of proof. Independent from C++, we also have such code in Python, equally difficult to keep track of. The most recent I'm working on is _arpack.py with three big classes and single instances. It is really not a great experience I can tell you to troubleshoot.

We just need three consecutive lapack calls for getrf/gecon/getri. Look at all the preparation code that goes into it. And I can't import it to a C code if we have to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's just a bag holder of LAPACK arguments, so instead of

getrf(&n, &n, a,  &lda, ....)

you initialize the lda (once) and do getrf(getrf_data). To me, these have two advantages:

  • cut down on having to finger count the LAPACK arguments (trivial for getrf, fun when there are 11 or so).
  • when implementing, say, LU, I don't need to copy-paste the LAPACK call and double-check the variables. It's already done in the constructor.

The memory argument I get, but frankly, I don't see how it's relevant here. First, how on earth are you going to end up with a 100 simultaneously active lapack calls. This is a local variable, nothing more. Note that array allocations are separate.
Second, even if you end up with, say, 100 threads, each of which is running a LAPACK computation---so that you do have 100 getrf_data variables, you also have done 100 mallocs for the work arrays, and 100 matrices to factorize. Surely you have bigger issues with memory/CPU switching then 100 copies of five int variables.

So it is all down to stylistic preferences ISTM. From where I stand, it is not a deal-breaker, so how about either,

  • rename the constructor into init_getrf function, or
  • ditch the structs and just call ?getrf(...) manually.

Would either of these address your concerns?

Copy link
Member Author

@ev-br ev-br left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's try to converge on a version with minimum viable amount of boilerplate :-), so that we can proceed chipping away on get_lapack_funcs.

There are a couple of questions below, to try clarifying what kind of reworks would work for you. We could hop on a higher-bandwidth conversation, too, if that's helpful.

For me, the main thing is that currently the main loop, inv_loop is under 100 LOCs, with memory allocations and comments.

To keep it in that ballpark, we do need some sort of templating, be it template<typename T> or Tempita or even distutils templates [1] if push comes to shove.

[1] not my first choice, mind you :-).

)


py3.extension_module('_batched_linalg',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it in a "jpeg loading" way: here's an extension module, self-contained. Once it has some more usage, we figure the common parts and make them shared (TBD if via a static library or a common header)

#include "_npymath.h"


using namespace _numpymath;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole _npymath.h header exists mainly to cover for what numpy/npy_math.h lacks, and because I work here with npy_cdouble directly without converting to C/C++ complex. This way there is no need forreinterpret_cast or anything.

Or is the question why it is in a namespace at all? Well, using namespace in a header is not a good practice indeed, so I should probably remove the namespacing. Or, better, put everything into a namespace.



// parroted from sqrtm
// XXX can probably be replaced by ?copy from BLAS
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression was that incx, incy arguments are essentially strides. But if not, great, I'll remove the comment.

* Copy each slice into a buffer, make the buffer F-ordered
* for LAPACK.
*/
struct iter_data_t
Copy link
Member Author

@ev-br ev-br Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, no, it's exactly equivalent to the sqrtm looping. Even the names are the same :-).
The usage is (https://github.com/scipy/scipy/pull/22838/files#diff-4c915d9c644759eaa646244756e388af6a0d49f33d6546864b5cff09ce056431R264

iter_data_t iter_data(array_object);    // grab shape/strides, compute the offset pointer

for (npy_intp idx=0; idx < iter_data.outer_size; idx++(
    iter_data.copy_slice(idx, buffer);              // copy slice `idx` into `buffer`
    ...
) 

The only difference to sqrtm version is that here I transpose into the F order right away instead of copy first, swap_cf immediately after.

If having get_buffer as a method of a struct bothers you, no problem, I'll rewrite it as a free function
copy_slice(iter_data, idx, buffer). Will that cover your concerns?

/*
* Hold the GESV related variables, handle allocation/deallocation.
*/
template<typename T>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's just a bag holder of LAPACK arguments, so instead of

getrf(&n, &n, a,  &lda, ....)

you initialize the lda (once) and do getrf(getrf_data). To me, these have two advantages:

  • cut down on having to finger count the LAPACK arguments (trivial for getrf, fun when there are 11 or so).
  • when implementing, say, LU, I don't need to copy-paste the LAPACK call and double-check the variables. It's already done in the constructor.

The memory argument I get, but frankly, I don't see how it's relevant here. First, how on earth are you going to end up with a 100 simultaneously active lapack calls. This is a local variable, nothing more. Note that array allocations are separate.
Second, even if you end up with, say, 100 threads, each of which is running a LAPACK computation---so that you do have 100 getrf_data variables, you also have done 100 mallocs for the work arrays, and 100 matrices to factorize. Surely you have bigger issues with memory/CPU switching then 100 copies of five int variables.

So it is all down to stylistic preferences ISTM. From where I stand, it is not a deal-breaker, so how about either,

  • rename the constructor into init_getrf function, or
  • ditch the structs and just call ?getrf(...) manually.

Would either of these address your concerns?

ev-br added 4 commits April 18, 2025 15:15
- a 2D singular matrix raises
- a batched matrix with all slices being singular: raises
- a batched matrix with some slices singular and some not: fill the singular ones with nans
@ev-br
Copy link
Member Author

ev-br commented Apr 18, 2025

Okay, following up on #22838 (comment), 96c1831 ditches the getrf_data structs and inlines the LAPACK calls.

@ilayn
Copy link
Member

ilayn commented Apr 19, 2025

Just sent you a link to the board currently showing this so we can organize things a bit better

shapes at 25-04-19 22 26 19

@ilayn
Copy link
Member

ilayn commented Apr 19, 2025

Talking to np.errstate is an excellent idea if it is possible by the way which would also solve our dichotomy about #22476. I'll dig a bit about it.

@j-bowhay
Copy link
Member

j-bowhay commented Apr 20, 2025

It would be great to coordinate with #28782 so there aren't multiple ways of handling errors in stacks of matrices

@ev-br
Copy link
Member Author

ev-br commented Jun 10, 2025

superseded by #22924, closing

@ev-br ev-br closed this Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C/C++ Items related to the internal C/C++ code base enhancement A new feature or improvement Meson Items related to the introduction of Meson as the new build system for SciPy RFC Request for Comments; typically used to gather feedback for a substantial change proposal scipy.linalg

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants