-
-
Notifications
You must be signed in to change notification settings - Fork 11k
BUG: Matrix multiplication in windows produces large numerical errors inconsistently in numpy >= 2 #27036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It does seem to be a problem with OpenBLAS and threading on windows. NumPy 1.26.4 uses |
I wonder how our CI passed |
I don't think this is related to the kernels chosen. The test still fails when I set |
what does 0.3.27.44 correspond to, git-wise ? If post-0.3.27, any chance this could be related to the NAN handling changes in SCAL that I merged on June 29 ? |
For what it's worth, sunpy has this test because 4 years ago we encountered intermittent failures with OpenBLAS and multithreading on macOS (see sunpy/sunpy#4290 (comment)). Those failures went away eventually with a subsequent release of OpenBLAS. |
Importantly, there were and maybe still are bad interactions when a threaded program used OpenBLAS that was configured to use threads as well. Hence, that's why setting OpenBLAS to use a single thread "fixed" issues. |
Windows gets a lot less attention, and there were a few PRs merged since 0.3.23 (whatever build your -293 is) and 0.3.27-whatever that (supposedly) improved multithreading on that platform. Notably OpenMathLib/OpenBLAS#4359 - not trying to blame anyone, just that this would be the most serious change (merged in December, so debuting in 0.3.26) |
Is it possible to test this against an earlier (pre 0.3.26) OpenBLAS version to isolate whether this could be related to the Windows threading changes? @martin-frbg do we have existing validation tests that would/should show this issue? Does this show up for any NTHREADS > 1? Do we have a C repro for this yet? If not I will attempt to translate the Python to C and see if I can reproduce/debug it. |
I have just managed to set up a virtual environment for reproducing this (with a pip-installed numpy - I did not get conda to install anything other than mkl or netlib even if it claimed to have switched to openblas). It definitely happens with as few as 4 cores (though not on every run), but I have not yet gotten it to fail with just two. (Unfortunately I haven't figured out how to downgrade the openblas version, pip tells me there are no packages that match the version requirement) |
3 threads are definitely enough to reproduce it (on the fourth try), I still haven't seen a failure with 2 in about 20 runs so far |
The openblas implementation is in |
Note I am close to uploading a newer scipy-openblas64 based on latest OpenBLAS HEAD. You can download a pre-release from https://anaconda.org/scientific-python-nightly-wheels/scipy-openblas64/files. You want scipy_openblas64-0.3.27.341.0-py3-none-win_amd64.whl. |
@mattip thank you very much. No failures seen so far with 4 threads and the 0.3.24.95.1 version of the library copied over the original libscipy_openblas64_-fb1711452d4d8cee9f276fd1449ee5c7.dll in numpy.libs |
This comment was marked as off-topic.
This comment was marked as off-topic.
@Siddharth-Latthe-07 is that a chatgpt-generated comment? Your comment does not add to the body of knowledge around the issue. We already know it is related to OpenBLAS, and that it is related to some thread-related problem inside OpenBLAS that only manifests itself on windows, so your analysis of the problem is not helpful. Converting the wheels we ship to another BLAS/LAPACK implementation is not something we are considering at this time. |
I've seen this happen with dtype = float32 as well. @martin-frbg in validating that this doesn't happen with earlier versions have you by chance been able to grab a log file with server tracing enabled? |
How can I help provide a version with that enabled? |
@mattip if OpenBLAS is built with SMP_DEBUG defined then the thread server will send debug logging to stderr. That may help hint at what is going on. |
@mseminatore have not gotten around to using an own build in that context yet, sorry. will try tonight. |
@martin-frbg in looking over the Windows server code I note the merge of OpenMathLib/OpenBLAS#4577 in April which says it changes how threads are allocated and introduces new functions like Not trying to shift the focus elsewhere, but it would be helpful to rule out that merge as a potential source of these issues. @mattip Is there any sense for when this issue started? IOW is the repro test case something that is run regularly or only when an issue is identified? As Martin pointed out, the most significant code change to Windows threading was my merge (4359) on Dec. 5th 2023 so that is a prime candidate and I will look over the code, but wondering if that long of a latency in discovering the issue is expected. |
#4577 was merged only after the 0.3.27 release so probably/hopefully not in that build yet. |
(conda issue is conda-forge/openblas-feedstock#160 (comment)) |
Thank you @martin-frbg that is helpful to know! I will continue to review the code and attempt to debug the repro in pursuit of a working theory and a potential fix. Please let me know should you decide that you would prefer reverting the code given that this is not the first issue we've encountered and that this is having downstream impact. Regardless of whether we keep or revert the Windows code I hope we can expand test coverage to allow us to catch similar issues earlier in the future. |
I'll see if I can come up with anything tonight. We are already close to a month past my tentative release date for 0.3.28 due to various stability problems (including with my health), so I guess it would not be a problem to add another week in case it becomes obvious where this goes wrong. |
not sure if this output from a build with SMP_DEBUG is going to help (stopped the program after it reported the first couple of mismatches) |
@mattip do you happen to know the magnitude of the errors? Is it just outside of epsilon or large? Trying to roughly characterize this as either lost work or error accumulation to guide the investigation. |
No, I still am seeing problems even with this. |
If I shrink the size of |
sorry I do not understand - does that mean your 0.3.27.44 happened to be OpenBLAS-0.3.27 exactly, or that its exact commit hash past 0.3.27 can no longer be determined ? (de465ffd would appear to be shortly before 4577 landed so I think that other thread-handling change should not have any bearing if the sunpy folks used "0.3.27.44") |
Good catch but that would be extra weird as mseminatore's work was entirely threading infrastructure not concerned with data sizes, and I do not recall any PR in the relevant timeframe that could have replaced a "blasint" with a standard int (rendering your regular INTERFACE64=1 build option useless). I had started bisecting, but ran into weird stability issues with the VM and/or compiler so need to redo everything... |
Yes |
The last issue was a bug in queue management where work submitted re-entrantly was lost. The way the join logic works I am reviewing the log file that Martin captured to see if there are any hints there. |
OK, looking over the log file, I am seeing the following pattern for work submission:
This suggests that each call to I would expect the matrix to be converted into a number of sub-tiles/sub-tasks in a work_queue (such that each sub-tile A, B and C all fit in the L2$) but I am not familiar with the OpenBLAS L3 work model. My math says that a MxN matrix of order (2, 500000) of float64 would be 8MB. Therefore A, B, and C would be 8MB, 32b (2 * 2 * 8) and 8MB, thus 16MB in size, larger than any L2$ I've seen. As an example, my own BLAS code decomposes this GEMM call into 1954 sub-tasks in the work_queue. But I am dealing with a family medical situation, it's late and I'm tired so I may be thinking wrongly about it. |
Saw this too, maybe it is the extreme imbalance of matrix dimensiions (and openblas failing to split on N). |
sorry, so far I have only managed to trash my build environment - the ancient m2w64-gcc (5) from conda (that I am sure worked on another system on the weekend) suddenly throws internal compiler errors when building OpenBLAS. |
Now I am getting myself confused. Rerunning the test with a 0.3.26 OpenBLAS fails, both when I use the DLL from the 0.3.26.4 wheel and also when I rebuild OpenBLAS v0.3.26. I am using this to rebuild OpenBLAS with the compilers from the rtools package. This installs gcc/gfortran 10.3
Then I can copy the DLL over the one used by numpy. Unfortunately, the prefixing does not work correctly pre-0.3.26, so I will continue bisecting backwards from 0.3.26 using NumPy 1.26.4 (which does not have any |
I have 0.3.26 building for NumPy 1.26.4 (using only |
Interesting, that would mean I am in the clear with my GEMM thread count change from PR 4585. Thanks for the chocolatey pointer, looks like I'm up and running with that again now. |
I have a good commit with v0.3.25. Now I can bisect.
|
Hmm. good is at |
thanks - my bisect is four steps from completion, I am also building a current HEAD with 4359 manually reverted now "just in case". |
Indeed HEAD with the changes from 4359 reverted appears to work fine - whether that means the problem is in 4359 or only uncovered by it remains to be seen. (Bisect still needs two steps but appears to be gravitating towards the same result) |
Thank you @mattip abd @martin-frbg for the bisect work. I am sorry that I can’t more actively participate (for personal reasons) but will continue inspecting the code to see if I can identify a cause and fix. |
I think it is equally possible that the actual problem lies somewhere in driver/level3/level3_thread.c (though it has its own CriticalSection), but I lack the Windows developer experience to assess this |
I haven’t looked very closely at that code but I will take a look. |
Any further thoughts about #OpenMathLib/OpenBLAS#4835 and whether NumPy should build its OpenBLAS with that PR as a patch to move the NumPy2.1 release forward? |
not sure if that counts as thoughts, but my current intention is to merge 4835 in order to get the 0.3.28 release out in the next couple of days (probably not today though) |
I backported the changes in OpenBLAS's 4835 in the 0.3.27.44.4 version of scipy-openblas, and used those wheels in #27140. I also added the test to prevent this from regressing again. Closing. Please reopen or open a new issue if I missed something. |
Uh oh!
There was an error while loading. Please reload this page.
Describe the issue:
We discovered a simple test failing on windows only: sunpy/sunpy#7754
I have investigated the symptoms of this in some detail but have not tried to find the cause: In short it seems like matrix multiplications with largeish numbers fails inconsistently in windows, and when it does fail it will fail a few times in a row before returning to normal. I would suspect openblas but the failure never occurs with numpy <2
Reproduce the code example:
Error message:
This bug is absolutely wild by the way:
With numpy 2.0.1:
with numpy 2.0.0
with 1.26.4 it passes.
So it is a numpy 2 issue and not a 2.0.1 issue... it is also random:
Inserting a return and a break on the first time the matmul fails:
In a given set of multiplications the errors always have the same values.
As you can see they also have the weird symmetry where the first column is -10* the second column which I guess makes some twisted sense because thats what the calculation is.
The errors are also always in the same region i.e. with indexs > 400k or so

They also always fail multiple indexes in a row, e.g. you might have 20 or 50 or whatever failed calcualtions in a row but then it starts working again.
e.g.
or
You can see in one of the plots there has been the same failure for 2 separate sequences of indices.
Python and NumPy Versions:
Windows 10, cannot recreate this at all in linux.
This also failed on our CI on a windows-latest runner.
https://github.com/sunpy/sunpy/actions/runs/10072095800/job/27843572893#step:10:5160
Runtime Environment:
Context for the issue:
Any matrix multiplication result could go wrong at any time in windows.
The text was updated successfully, but these errors were encountered: