-
-
Notifications
You must be signed in to change notification settings - Fork 11k
BUG: Very slow execution of linalg.inv on intel ultra 7 #27174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just checking - have you given the runtime environment from the relevant (slowest) install? |
Thank you for the prompt check. Sorry, no it was initially for 1.26.3 (Gohlke), now I have corrected it for the slowest 1.26.4 (pip direct from pypi). |
FWIW, I get ~18 ms with numpy+MKL 1.6.4 and 2.0.0 on i7-14700K. |
@cgohlke thank you for this info (and your effort in preparing the compiled packages:)). Matrix inversion is an essential tool for our group and most processors we tested are from 20-60ms with the only exception beeing the Intel Ultra 7 which performs unaceptable slow (3 orders of magnitude slower) |
I'm guessing that the number of threads may play a role here. IIRC the defaults for MKL linked wheels are different to those built against openblas on pypi. |
It may make sense to try with NumPy 2 since it ships a newer OpenBLAS. Also try If upgrading openblas doesn't solve it, we should ping the openblas devs (or open an issue there). |
Something like this will allow you to play with the number of threads in play, and also print out relevant runtime information:
Run this as Please report back what you find out. |
Thank you all for the prompt response. This is indeed a threading issue. Please see details of the analysis below. Running the issue27174.pynumpy 1.26.4Single core:{'architecture': 'Haswell', multiple cores:n=1: 0.29s numpy 2.0.1Single core:{'architecture': 'Haswell', multiple cores:n=1: 0.27s Additional commentsThe time explodes after 16 threads. If I use 22 and look at the resource monitor I see 6 cpu-s are still "parked" and no load is applied to them. The description of the processor is 16cores, 22 logical processors. |
OK, so 6 are performance cores (with hyper-threading, so x2 threads), 8 are "efficient cores" cores and the last 2 are "low power efficiency cores (https://ark.intel.com/content/www/us/en/ark/products/236847/intel-core-ultra-7-processor-155h-24m-cache-up-to-4-80-ghz.html). But not sure how that explains why performance takes a nose dive after 16 threads... It seems to me someone with knowledge about this architecture needs to teach OpenBLAS exactly what the right guess for the number of threads is (unless there is even more complexity to it)... |
Interesting to see this also happening with the old 0.3.23... I wonder if the numpy-2.0.1 used already has the latest fix for 0.3.27, or if we could me mixing/substituting problems there.. but neither version would have dedicated code for handling the peculiar situation of having three tiers of cores if that has anything to do with it. I bought an i5-125H a few weeks ago but have not run Windows on it... |
IIRC, 2.0.1 would be in a state where it has the weird windows threading bug? Anyway, the 2.1rc just landed, if there might have been a newer fix would be the thing to check. The quick thing would be to try with |
Thank you all. I have tried to installed This is what I get when I try to install a specific version
|
Sorry, my bad. The release was tagged, but not uploaded to PyPI yet. You could try the nightlies:
(you may need |
So it probably is not directly a problem with the threading code perse, more a problem with oversubscription and waiting for the slow cores. I wonder if the same CPU running linux (in WSL even?) exhibits this problem.
NumPy 2.0.1, like NumPy 2.0.0 uses exactly v0.3.27, with no additional patches or fixes. This means it is susceptible to the bug in #27036, so probably not advisable to use in production.
I don't think those wheels are available anywhere just yet, maybe in a few days. In any case, the windows threading performance should be similar to the 1.26.4 ones since those wheels use a patched OpenBLAS to revert the problem in #27036. |
Thank you @seberg . I was able to install numpy 2.2.0.dev0 The results are: n=1: 0.24s ---update--- n=1: 0.23s |
OK, the last result looks a lot less frustrating (thanks for running all these tests !). So part of the problem appears to have been firmware or driver issues, and probably OpenBLAS "only" needs to learn to ignore the ultra-efficient cores (which are probably good enough for keeping Windows background tasks alive, but not much else). |
BTW from Intel developer community forum there is no straightforward call to identify the LP cores, but they can be recognized by being unable to access the L3 cache |
EDIT: nevermind, I see the same on my MTL too :/ |
BTW, this doesn't seem to be related to E-core/P-core or anything to do with Windows. I see this behavior on my linux SkylakeX too which has all uniform cores. The threshold seems to be the number of cores available in the system. It slows down quite a bit when you set |
@r-devulap architecture is reported as "Haswell" as that matches its capabilities as far as OpenBLAS is concerned - AVX2 but no AVX512. |
[EDIT]: these numbers are with turbo disabled. Doesn't look it on my MTL which has 2 P cores, 8 efficient cores and 2 low power efficient cores (total 14 threads). the performance seems just fine till the threads exceed 14:
|
hah, @martin-frbg you are right. Looks like that behavior was with turbo disabled which makes all the cores run at the same low frequency. Once I enable turbo, the P cores and E cores run at a much higher frequency and the low power e cores cant catch up and the performance nose dives. |
The problem is not observable under Linux - it is only when the actual thread count available gets exceeded that the benchmark time goes up drastically. My original tests were with an older system setup (Linuxkernel 6.4) where it went from about 0.1 to 12 seconds. With the latest available (6.11rc3) which contains specific tuning for MeteorLake provided by Intel engineers, it jumps from 0.1 to 3.5 seconds (at 18/19 threads on ultra5-125H). |
Uh oh!
There was an error while loading. Please reload this page.
Describe the issue:
I am testing this code:
on an Intel ultra 7 processor (155H).
This should take approx 30-50ms, but it takes 12000ms on numpy 1.26.4 or 2.0 (using pip install numpy). This is really crazy slow!!!
If I unstalled the version 1.26.3 with MKL support from Gohlke:
https://github.com/cgohlke/numpy-mkl-wheels/releases/tag/v2024.1.3
the timing was appox 2000ms, whic is still 100x more than normal.
If I installed Anaconda (which I do not prefer), I have reached 30-50ms.
I am doing anything wrong?
Reproduce the code example:
Error message:
No response
Python and NumPy Versions:
1.26.3
3.12.5 (tags/v3.12.5:ff3bc82, Aug 6 2024, 20:45:27) [MSC v.1940 64 bit (AMD64)]
Runtime Environment:
[{'numpy_version': '1.26.4',
'python': '3.12.5 (tags/v3.12.5:ff3bc82, Aug 6 2024, 20:45:27) [MSC v.1940 '
'64 bit (AMD64)]',
'uname': uname_result(system='Windows', node='Spectre-JS', release='11', version='10.0.22631', machine='AMD64')},
{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}},
{'architecture': 'Haswell',
'filepath': 'C:\Users\janko\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy.libs\libopenblas64__v0.3.23-293-gc2f4bdbb-gcc_10_3_0-2bde3a66a51006b2b53eb373ff767a3f.dll',
'internal_api': 'openblas',
'num_threads': 22,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.23.dev'}]
Context for the issue:
Matrix inversion is essential to me. Such slow inversion makes my code unusable or I have to go for Anaconda.
The text was updated successfully, but these errors were encountered: