Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Multiplication of a matrix with its transpose causes segfault for "large" matrices #19685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dsaoijhlgfdngfsd opened this issue Aug 17, 2021 · 9 comments

Comments

@dsaoijhlgfdngfsd
Copy link

Multiplication of a matrix with its transpose causes segfault for "large" matrices. Multiplication works for "small" matrices and also works if transpose is copied (np.copy()). Specific examples of large/small matrices below.

I do have sufficient amount of memory (>350GB) on the machine, and I observe the same behavior on EC2 instance (both Ubuntu and Amazon Linux) as well as in a Docker container.

Reproducing code example:

This fails:

import numpy as np
matrix = np.random.rand(80_000, 3072)
out = matrix.dot(matrix.T)

This also fails:

import numpy as np
matrix = np.random.rand(80_000, 3072)
out = matrix @ matrix.T

All these work:

import numpy as np
matrix = np.random.rand(75_000, 3072)
out = matrix @ matrix.T
import numpy as np
matrix = np.random.rand(75_000, 3072)
out = matrix.dot(matrix.T)
import numpy as np
matrix = np.random.rand(80_000, 3072)
out = matrix @ np.copy(matrix.T)

Error message:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc6f108cd49 in dgemm_oncopy_SKYLAKEX ()
   from /opt/bitnami/python/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-2d23e62b.3.17.so
[Current thread is 1 (Thread 0x7fc6f2a92f00 (LWP 676))]
(gdb) where
#0  0x00007fc6f108cd49 in dgemm_oncopy_SKYLAKEX ()
   from /opt/bitnami/python/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-2d23e62b.3.17.so
#1  0x00007fc6f01a0734 in inner_thread ()
   from /opt/bitnami/python/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-2d23e62b.3.17.so
#2  0x00007fc6f02c06f5 in exec_blas ()
   from /opt/bitnami/python/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-2d23e62b.3.17.so
#3  0x00007fc6f01a10f3 in dsyrk_thread_LT ()
   from /opt/bitnami/python/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-2d23e62b.3.17.so
#4  0x00007fc6f00b08ab in cblas_dsyrk ()
   from /opt/bitnami/python/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-2d23e62b.3.17.so
#5  0x00007fc6f21e45cd in DOUBLE_matmul_matrixmatrix.constprop.6 ()
   from /opt/bitnami/python/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so
#6  0x00007fc6f21e86fb in DOUBLE_matmul ()
   from /opt/bitnami/python/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so
#7  0x00007fc6f21f7b49 in PyUFunc_GeneralizedFunctionInternal ()

NumPy/Python version information:

1.21.2 3.8.10 (default, Jun  2 2021, 10:49:15)
[GCC 9.4.0]
@dsaoijhlgfdngfsd
Copy link
Author

dsaoijhlgfdngfsd commented Aug 18, 2021

I don't really understand your question. CPU caches are of course orders of magnitude smaller. However I fail to understand your logic here, since the multiplication between two different matrices does not segfault. And in that case the input size is twice as much with output size being the same.

Or am I missing something?

@mattip
Copy link
Member

mattip commented Aug 18, 2021

The last one is puzzling. We do have an optimization to detect a @ a.T (same memory pointer, switched shapes) and use the *SYRK routines instead of the *GEMM routines, but I see from your stack trace your are hitting the *GEMM routine anyway. That does not explain why it segfaults, but I do wonder why it does not go via the *SYRK routines.

@dsaoijhlgfdngfsd
Copy link
Author

If you look at #4 in the stack trace, there is:

#4  0x00007fc6f00b08ab in cblas_dsyrk ()

So i think it goes to the *SYRKroutine.

If you have any suggestions/tips how to debug that further I can certainly give it a try.

@mattip
Copy link
Member

mattip commented Aug 18, 2021

Thanks. It seems the segfaults are only in the SYRK codepath, which the last variant avoids since it copies the data. I am not familiar with the OpenBLAS internals. Maybe @martin-frbg has a thought? Did you try using fewer threads with OPENBLAS_NUM_THREADS=1 (which may be significantly slower)?

@dsaoijhlgfdngfsd
Copy link
Author

dsaoijhlgfdngfsd commented Aug 18, 2021

Oh yeah, I did and clearly forgot to mention that. Yes, OPENBLAS_NUM_THREADS=1 fixes the issue.

@martin-frbg
Copy link

@mattip Bet it is BUFFERSIZE too small for the multithreaded GEMM routines - remember you build with a custom value to reduce the memory footprint of "everyday small matrix cases" (#18141) but that is bound to go wrong for "large" matrices.

@mattip
Copy link
Member

mattip commented Aug 18, 2021

Ahh thanks, makes sense. Is there a way we can trap this without segfaulting and warn the user to use fewer threads?

@MarcT0K
Copy link

MarcT0K commented Jan 26, 2024

Hi, my comment is simply to highlight that this issue still present. Moreover, the temporary solutions also still work.

It is not a critical issue, but I simply wanted to leave a comment if ever someone wanted to work on it in the near future.

@martin-frbg
Copy link

I guess it would be possible to calculate columns times rows times bytes per variable type and check if that fits into the compile-time BUFFERSIZE for the bundled OpenBLAS (which IIRC is built with a smaller-than-default buffer allocation to reduce memory footprint). OTOH it would probably be tedious to do this every time, and one couldn't be sure the user is
actually using the pip/conda-bundled library. On the OpenBLAS side, I'm not sure either if it would be easy to detect an impending overflow, let alone do anything about it, without losing performance in "normal" operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants