-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
random Segfaults on distance_transform_edt with Intel 12 Alder lake (E-Core enabled) #22744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you please run:
and report the results? |
It seems that this problem is unrelated to scikit-learn because it happens in a scipy call and I cannot see any sklearn related call in the tracebacks:
but it could be related to OpenBLAS or numpy that do have CPU-specific optimizations. Hence the output of threadpoolctl might help clarify which OpenBLAS version you are using. |
Here is output of these information, maybe I should upgrade OpenBLAS to 0.3.20 ?
|
Having numpy and scipy using different versions of openblas has not been an issue so far so I don't think it has anything to do with the issue you're facing. The backtrace and the snippet don't involve scikit-learn at all. This issue should probably be posted on the scipy issue tracker. I'm closing it here. |
I have confirmed that DRAM issue caused this problem, not OpenBLAS or MKL. But it seems that training with MKL runs slower than OpenBLAS library with Intel 12th CPU and linux 5.17 (3.5 vs 4.0 iteration per sec) ... Due to some compatibility related reason, once I enable X.M.P. built-in memory over-clocking function on BIOS, random kernel panic and CRC error occurred when running without loading external program or unzipping especially running Windows Server. After I turn off X.M.P. over-clocking and change DRAM, it works well. |
Hi everyone
I am currently training a image segmentation network with PyTorch evaluated with hausdorff distance loss. To calculate hausdorff loss, I am using distance_transform_edt from scipy.ndimage
associated with morpholopy.py provided by scikit-learn.
My training script works well on other platforms, including PC(Intel i5-9400F, RTX 2060, Windows 10), Server 1 (AMD Ryzen 7 2700X, RTX A4000, Fedora 33), Server 2 (AMD Ryzen 7 3700X, RTX A4000, Fedora 34).
However, when I try to train my model on my linux PC (Intel i7-12700K, RTX 3080Ti, Manjaro, Linux Core: 5.16), my computer crashs several times. Mostly just training terminated with out exception, and it shows Segmentation fault related to threading.py, queues.py, morphology.py (details describe below), and sometimes it even causes linux kernel panic so I have to force reboot for getting control.
It occurs randomly, I have tried to install Ubuntu 20.04 LTS with linux kernel 5.15 or 5.16, install PyTorch Nightly version, install scikit-learn-intelex, install latest numpy with mkl, but it still happens.
No evidence of over-temperature, GPU memory overflow can be observed by utilizing sensors and nvidia-smi command.
I have noticed that on Intel 12th Alder lake some architecture have been changed to improve performance so that seems suspicious.
Any idea what I can do?
Thanks in advance.
Steps/Code to Reproduce
Expected Results
I can get numerical result correctly
Actual Results
dmesg:
gdb bt:
Versions
The text was updated successfully, but these errors were encountered: