POC/RFC: linalg: low-level nD support #22838

ev-br · 2025-04-14T20:08:28Z

Reference issue

towards #21466
supersedes and closes #21935

What does this implement/fix?

Add infrastructure for low-level batching in scipy.linalg.

This is an alternative to gh-21935, which copy-pasted the gufunc infrastructure from numpy. This PR, instead, does manual looping over the batch dimensions, with the iterator from sqrtm, cc #22406 (comment)

Similar to gh-21935, here I convert inv. Which by itself is not a very interesting function; it's just simple enough to be useful as a guinea pig for the infrastructure. The PR looks large, but PRs for additional functions will be much smaller.

Remaining TBDs and action items:

~~overwrite_a not done yet; probably worth tackling in a follow-up;~~
"weird" dtypes, float16 and longdouble; also backwards compatible handling of integer array-likes
what strategy we we use for when operation fails for some subarrays and succeeds for others: do we silently fill the failing parts with nans, or raise or fill with nans and emit a warning, cf RFC:linalg: nD-array accepting functions to return partially completed work instead of discarding all and throwing an exception #22476. Currently this PR does quiet fill-with-nans.

Additional information

Quick-and-dirty performance measurements: basically, we are on par with numpy.linalg, which is to say 5-10x faster than the current scipy main for deep stacks of matrices of small core dimension.

In [1]: from scipy.linalg import inv

In [2]: from scipy.linalg._basic import inv0

In [3]: import numpy as np

In [4]: n = 10

In [5]: a = np.ones((n//2, n//5, n//10, n, n), dtype=float) + 8*np.eye(n)

In [6]: %timeit np.linalg.inv(a)
26.4 μs ± 409 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [7]: %timeit inv(a)
23.4 μs ± 187 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [8]: %timeit inv0(a)
211 μs ± 834 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

scipy/linalg/_basic.py

ilayn · 2025-04-15T19:12:10Z

Thank you for this. I get stomach aches by looking at Arpack these days. I'll get back to this once I got over my nausea. I can't say I can follow the OOP parts though. Seems like we have 10x boilerplate instead of 4x boilerplate now 😝 Let me finish that arpack thing up then we can go back to our skirmish about C vs. ++

Two things we better do with inv is recycling the malloc'd buffer in every spin and injecting gecon call in the middle of getrf and getri

ev-br · 2025-04-17T15:31:45Z

Thanks Ilhan. Would be great to detail your specific concerns about this, once you're safely back from your travels in the wondrous ARPACK-land of course :-).

Meanwhile, the last commit

adds overwrite_a=True in a backwards compatible way;
refactors the code a bit to make it more C-like and hopefully easier to follow.

I don't think I'm using any OOP, and the ++ features are at minimum. In fact, it's pretty much all C (manual memory management, goto done etc etc) with a few basic templates to cut down in 4x duplication of algorithmic parts.

And I believe the boilerplate here is enough to very easily implement all LU-related functionality (det, lu, lu_{factor,solve}). Can do it either here or in a follow-up.

One thing that bugs me still is error handling, gh-22476. numpy.linalg does something that their gufuncs talk to np.errstate. This is hidden from the user in numpy (https://github.com/numpy/numpy/blob/v2.2.0/numpy/linalg/_linalg.py#L607), so maybe the right thing for scipy.linalg is to use a sensible default (raise in 2D, fill with nans for ndim > 2) and make it talk to np.errstate, so that users can use the context manager. I'll take a look at how hard it is to make it a thing.

While at it, refactor a bit to make code more C-like: - move npymath related helpers to their own header - simplify lapack_trampolines - refactor the inv_loop

ilayn

I think I have to write a similar implementation to explain myself.

I feel that this is not fair to criticize over your work without putting any effort but just typing terse sentences that I know we will have a back and forth. So please take them lightly until I write something very arcane so you would have the fair chance to dunk on the shortcomings of what I do 😃

Regardless of what we do in the end, this is really more performant and cleaner in my opinion than doing bunch of get_lapack_funcs and losing lots of performance hidden away. And even in your version we only have 200ish lines and the rest is going to be recycled for other funcs. So overall, I think this is worth the effort.

scipy/linalg/_basic.py

ilayn · 2025-04-18T11:19:02Z

scipy/linalg/meson.build

 )

+
+py3.extension_module('_batched_linalg',


If you don't do this as a static (and shared for BLAS/LAPACK dependency) library, other C code can't use it. That's why it is just a header file in common utils right now for sqrtm and eventually for others.

Let's do it in a "jpeg loading" way: here's an extension module, self-contained. Once it has some more usage, we figure the common parts and make them shared (TBD if via a static library or a common header)

ilayn · 2025-04-18T11:19:50Z

scipy/linalg/src/_batched_linalg.h

+#include "_npymath.h"
+
+
+using namespace _numpymath;


why is this namespace _numpymath required? Can't we not template using the regular dependencies to NumPy without our own extra boilerplates?

The whole _npymath.h header exists mainly to cover for what numpy/npy_math.h lacks, and because I work here with npy_cdouble directly without converting to C/C++ complex. This way there is no need forreinterpret_cast or anything.

Or is the question why it is in a namespace at all? Well, using namespace in a header is not a good practice indeed, so I should probably remove the namespacing. Or, better, put everything into a namespace.

ilayn · 2025-04-18T11:22:19Z

scipy/linalg/src/_batched_linalg.h

+
+
+// parroted from sqrtm
+// XXX can probably be replaced by ?copy from BLAS


No unfortunately, this is basically to transpose a matrix in smaller chunks that fits into L1 cache to reduce the cache misses.

At least in theory. I think we should unroll these to 8 for reals and 4 for complexes.I need a proper godbolt session to get a feeling for what compilers are choosing to do. I'll get to it eventually.

dcopy is BLAS' memcpy. They do different things.

My impression was that incx, incy arguments are essentially strides. But if not, great, I'll remove the comment.

ilayn · 2025-04-18T11:23:24Z

scipy/linalg/src/_batched_linalg.h

+ * Copy each slice into a buffer, make the buffer F-ordered
+ * for LAPACK.
+ */
+struct iter_data_t


I don't get your comment about not using OOP. This is pretty much a class with a single method. Instead of loop optimizations now you are converting it into a Python generator and yielding a slice here if I can read C++ correctly.

Err, no, it's exactly equivalent to the sqrtm looping. Even the names are the same :-).
The usage is (https://github.com/scipy/scipy/pull/22838/files#diff-4c915d9c644759eaa646244756e388af6a0d49f33d6546864b5cff09ce056431R264

iter_data_t iter_data(array_object); // grab shape/strides, compute the offset pointer for (npy_intp idx=0; idx < iter_data.outer_size; idx++( iter_data.copy_slice(idx, buffer); // copy slice `idx` into `buffer` ... )

The only difference to sqrtm version is that here I transpose into the F order right away instead of copy first, swap_cf immediately after.

If having get_buffer as a method of a struct bothers you, no problem, I'll rewrite it as a free function
copy_slice(iter_data, idx, buffer). Will that cover your concerns?

ilayn · 2025-04-18T11:23:35Z

scipy/linalg/src/_batched_linalg.h

+/*
+ * Invert a 2D slice:
+ *   - Input slice is in `getrf_data.a`.
+ *   - The result is in `getri_data.a`.


This is a class instance here. I really don't see the complication helping us here a class with a single method is pretty much a function. Still missing the call to gecon though.

scipy/linalg/src/_batched_linalg.h

ilayn · 2025-04-18T11:25:15Z

scipy/linalg/src/_lapack_trampolines.h

+/*
+ * Hold the GESV related variables, handle allocation/deallocation.
+ */
+template<typename T>


I understand the temptation but this is clearly a C++ class. We need to justify the need for a class within us first.

If it is always a single instance then this is just unnecessary code making things complicated and holding bunch of variables together. I can't reuse anything inside this class it is just wasted memory. Say, I already have a larger integer array, I want to use it with a gesv call. Now you are taking away that possibility with this or making it 10x harder to achieve because then I have to overwrite this with dependency injection or whatever the terminology is.

If we had 100 instances of this entity, like a computer game with 100 characters, yes by all means let's class'ify it but for single instances Class usage is an abstraction that carries the burden of proof. Independent from C++, we also have such code in Python, equally difficult to keep track of. The most recent I'm working on is _arpack.py with three big classes and single instances. It is really not a great experience I can tell you to troubleshoot.

We just need three consecutive lapack calls for getrf/gecon/getri. Look at all the preparation code that goes into it. And I can't import it to a C code if we have to.

Well, it's just a bag holder of LAPACK arguments, so instead of

getrf(&n, &n, a, &lda, ....)

you initialize the lda (once) and do getrf(getrf_data). To me, these have two advantages:

cut down on having to finger count the LAPACK arguments (trivial for getrf, fun when there are 11 or so).

when implementing, say, LU, I don't need to copy-paste the LAPACK call and double-check the variables. It's already done in the constructor.

The memory argument I get, but frankly, I don't see how it's relevant here. First, how on earth are you going to end up with a 100 simultaneously active lapack calls. This is a local variable, nothing more. Note that array allocations are separate.
Second, even if you end up with, say, 100 threads, each of which is running a LAPACK computation---so that you do have 100 getrf_data variables, you also have done 100 mallocs for the work arrays, and 100 matrices to factorize. Surely you have bigger issues with memory/CPU switching then 100 copies of five int variables.

So it is all down to stylistic preferences ISTM. From where I stand, it is not a deal-breaker, so how about either,

rename the constructor into init_getrf function, or

ditch the structs and just call ?getrf(...) manually.

Would either of these address your concerns?

scipy/linalg/src/_npymath.h

ev-br

Agreed, let's try to converge on a version with minimum viable amount of boilerplate :-), so that we can proceed chipping away on get_lapack_funcs.

There are a couple of questions below, to try clarifying what kind of reworks would work for you. We could hop on a higher-bandwidth conversation, too, if that's helpful.

For me, the main thing is that currently the main loop, inv_loop is under 100 LOCs, with memory allocations and comments.

To keep it in that ballpark, we do need some sort of templating, be it template<typename T> or Tempita or even distutils templates [1] if push comes to shove.

[1] not my first choice, mind you :-).

ev-br · 2025-04-18T12:10:25Z

scipy/linalg/meson.build

 )

+
+py3.extension_module('_batched_linalg',


Let's do it in a "jpeg loading" way: here's an extension module, self-contained. Once it has some more usage, we figure the common parts and make them shared (TBD if via a static library or a common header)

ev-br · 2025-04-18T12:15:59Z

scipy/linalg/src/_batched_linalg.h

+#include "_npymath.h"
+
+
+using namespace _numpymath;


The whole _npymath.h header exists mainly to cover for what numpy/npy_math.h lacks, and because I work here with npy_cdouble directly without converting to C/C++ complex. This way there is no need forreinterpret_cast or anything.

Or is the question why it is in a namespace at all? Well, using namespace in a header is not a good practice indeed, so I should probably remove the namespacing. Or, better, put everything into a namespace.

ev-br · 2025-04-18T12:17:49Z

scipy/linalg/src/_batched_linalg.h

+
+
+// parroted from sqrtm
+// XXX can probably be replaced by ?copy from BLAS


My impression was that incx, incy arguments are essentially strides. But if not, great, I'll remove the comment.

ev-br · 2025-04-18T12:25:56Z

scipy/linalg/src/_batched_linalg.h

+ * Copy each slice into a buffer, make the buffer F-ordered
+ * for LAPACK.
+ */
+struct iter_data_t


Err, no, it's exactly equivalent to the sqrtm looping. Even the names are the same :-).
The usage is (https://github.com/scipy/scipy/pull/22838/files#diff-4c915d9c644759eaa646244756e388af6a0d49f33d6546864b5cff09ce056431R264

iter_data_t iter_data(array_object); // grab shape/strides, compute the offset pointer for (npy_intp idx=0; idx < iter_data.outer_size; idx++( iter_data.copy_slice(idx, buffer); // copy slice `idx` into `buffer` ... )

The only difference to sqrtm version is that here I transpose into the F order right away instead of copy first, swap_cf immediately after.

If having get_buffer as a method of a struct bothers you, no problem, I'll rewrite it as a free function
copy_slice(iter_data, idx, buffer). Will that cover your concerns?

scipy/linalg/src/_batched_linalg.h

ev-br · 2025-04-18T12:49:02Z

scipy/linalg/src/_lapack_trampolines.h

+/*
+ * Hold the GESV related variables, handle allocation/deallocation.
+ */
+template<typename T>


Well, it's just a bag holder of LAPACK arguments, so instead of

getrf(&n, &n, a, &lda, ....)

you initialize the lda (once) and do getrf(getrf_data). To me, these have two advantages:

cut down on having to finger count the LAPACK arguments (trivial for getrf, fun when there are 11 or so).

when implementing, say, LU, I don't need to copy-paste the LAPACK call and double-check the variables. It's already done in the constructor.

The memory argument I get, but frankly, I don't see how it's relevant here. First, how on earth are you going to end up with a 100 simultaneously active lapack calls. This is a local variable, nothing more. Note that array allocations are separate.
Second, even if you end up with, say, 100 threads, each of which is running a LAPACK computation---so that you do have 100 getrf_data variables, you also have done 100 mallocs for the work arrays, and 100 matrices to factorize. Surely you have bigger issues with memory/CPU switching then 100 copies of five int variables.

So it is all down to stylistic preferences ISTM. From where I stand, it is not a deal-breaker, so how about either,

rename the constructor into init_getrf function, or

ditch the structs and just call ?getrf(...) manually.

Would either of these address your concerns?

scipy/linalg/src/_npymath.h

- a 2D singular matrix raises - a batched matrix with all slices being singular: raises - a batched matrix with some slices singular and some not: fill the singular ones with nans

…plicitly

ev-br · 2025-04-18T21:46:58Z

Okay, following up on #22838 (comment), 96c1831 ditches the getrf_data structs and inlines the LAPACK calls.

ilayn · 2025-04-19T20:31:29Z

Just sent you a link to the board currently showing this so we can organize things a bit better

ilayn · 2025-04-19T20:34:47Z

Talking to np.errstate is an excellent idea if it is possible by the way which would also solve our dichotomy about #22476. I'll dig a bit about it.

j-bowhay · 2025-04-20T18:28:04Z

It would be great to coordinate with #28782 so there aren't multiple ways of handling errors in stacks of matrices

ev-br · 2025-06-10T09:52:51Z

superseded by #22924, closing

ev-br added enhancement A new feature or improvement scipy.linalg labels Apr 14, 2025

ev-br requested review from ilayn and larsoner as code owners April 14, 2025 20:08

github-actions bot added C/C++ Items related to the internal C/C++ code base Meson Items related to the introduction of Meson as the new build system for SciPy RFC Request for Comments; typically used to gather feedback for a substantial change proposal labels Apr 14, 2025

ev-br mentioned this pull request Apr 14, 2025

ENH: linalg: Rewrite sqrtm in C with low-level nD support #22406

Merged

mdhaber reviewed Apr 14, 2025

View reviewed changes

scipy/linalg/_basic.py Show resolved Hide resolved

ev-br force-pushed the batched_inv2 branch 2 times, most recently from e826156 to 049195f Compare April 14, 2025 21:40

lucascolley changed the title ~~POC/RFC: low-level nD support in scipy.linalg~~ POC/RFC: linalg: low-level nD support Apr 15, 2025

ev-br force-pushed the batched_inv2 branch from 01e45cb to 768ce3c Compare April 17, 2025 15:33

ev-br added 5 commits April 18, 2025 09:55

ENH: linalg: add low-level batching to linalg.inv

0b69e64

TST: linalg: add tests for batched inv

67f6364

WIP: linalg: temp restore python-level batched inv as linalg._basic.inv0

0abc712

BUG: linalg: backwards compat inv with non-lapack dtypes

9ebee52

ENH: linalg: add overwrite_a=True to inv

2f2009f

While at it, refactor a bit to make code more C-like: - move npymath related helpers to their own header - simplify lapack_trampolines - refactor the inv_loop

ilayn reviewed Apr 18, 2025

View reviewed changes

MAINT: linalg/ batched inv: simplify, move overwrite_a handling to C

d54f086

ev-br force-pushed the batched_inv2 branch from 768ce3c to d54f086 Compare April 18, 2025 12:06

ev-br commented Apr 18, 2025

View reviewed changes

ev-br added 4 commits April 18, 2025 15:15

MAINT: linalg: remove outdates comments

cc4c9e9

MAINT: linalg: rewrite the identity_matrix

314495b

MAINT: linalg: make errors backwards compatible:

e326a02

- a 2D singular matrix raises - a batched matrix with all slices being singular: raises - a batched matrix with some slices singular and some not: fill the singular ones with nans

MAINT: linalg: remove per-function structs, spell LAPACK arguments ex…

96c1831

…plicitly

mdhaber mentioned this pull request Apr 19, 2025

BUG: crash in python-dbg CI job due to calling Python C API without holding the GIL #22860

Open

MAINT: linalg: split copy_slice off iter_data

fe75d86

This was referenced Apr 21, 2025

WIP: linalg: Comparison PR for #22838 #22867

Closed

ENH: np.linalg.inv: Allow disabling error when one matrix is singular in a stack numpy/numpy#28782

Closed

ev-br mentioned this pull request May 2, 2025

ENH: linalg: low-level nD support, take 3 #22924

Merged

ev-br closed this Jun 10, 2025



		// parroted from sqrtm
		// XXX can probably be replaced by ?copy from BLAS

		)


		py3.extension_module('_batched_linalg',

		)


		py3.extension_module('_batched_linalg',

Uh oh!

POC/RFC: linalg: low-level nD support #22838

POC/RFC: linalg: low-level nD support #22838

Uh oh!

Conversation

ev-br commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference issue

What does this implement/fix?

Additional information

Uh oh!

Uh oh!

ilayn commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ev-br commented Apr 17, 2025

Uh oh!

ilayn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ev-br Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ev-br left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ev-br Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ev-br commented Apr 18, 2025

Uh oh!

ilayn commented Apr 19, 2025

Uh oh!

ilayn commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j-bowhay commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ev-br commented Jun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

ev-br commented Apr 14, 2025 •

edited

Loading

ilayn commented Apr 15, 2025 •

edited

Loading

ev-br Apr 18, 2025 •

edited

Loading

ev-br Apr 18, 2025 •

edited

Loading

ilayn commented Apr 19, 2025 •

edited

Loading

j-bowhay commented Apr 20, 2025 •

edited

Loading