Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: Very slow execution of linalg.inv on intel ultra 7 #27174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jankoslavic opened this issue Aug 10, 2024 · 23 comments
Open

BUG: Very slow execution of linalg.inv on intel ultra 7 #27174

jankoslavic opened this issue Aug 10, 2024 · 23 comments
Labels

Comments

@jankoslavic
Copy link

jankoslavic commented Aug 10, 2024

Describe the issue:

I am testing this code:

import numpy as np
a = np.random.rand(1000, 1000)
ainv = np.linalg.inv(a)

on an Intel ultra 7 processor (155H).
This should take approx 30-50ms, but it takes 12000ms on numpy 1.26.4 or 2.0 (using pip install numpy). This is really crazy slow!!!

If I unstalled the version 1.26.3 with MKL support from Gohlke:
https://github.com/cgohlke/numpy-mkl-wheels/releases/tag/v2024.1.3
the timing was appox 2000ms, whic is still 100x more than normal.

If I installed Anaconda (which I do not prefer), I have reached 30-50ms.

I am doing anything wrong?

Reproduce the code example:

import numpy as np
a = np.random.rand(1000, 1000)
ainv = np.linalg.inv(a)

Error message:

No response

Python and NumPy Versions:

1.26.3
3.12.5 (tags/v3.12.5:ff3bc82, Aug 6 2024, 20:45:27) [MSC v.1940 64 bit (AMD64)]

Runtime Environment:

[{'numpy_version': '1.26.4',
'python': '3.12.5 (tags/v3.12.5:ff3bc82, Aug 6 2024, 20:45:27) [MSC v.1940 '
'64 bit (AMD64)]',
'uname': uname_result(system='Windows', node='Spectre-JS', release='11', version='10.0.22631', machine='AMD64')},
{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}},
{'architecture': 'Haswell',
'filepath': 'C:\Users\janko\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy.libs\libopenblas64__v0.3.23-293-gc2f4bdbb-gcc_10_3_0-2bde3a66a51006b2b53eb373ff767a3f.dll',
'internal_api': 'openblas',
'num_threads': 22,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.23.dev'}]

Context for the issue:

Matrix inversion is essential to me. Such slow inversion makes my code unusable or I have to go for Anaconda.

@matthew-brett
Copy link
Contributor

Just checking - have you given the runtime environment from the relevant (slowest) install?

@jankoslavic
Copy link
Author

Thank you for the prompt check. Sorry, no it was initially for 1.26.3 (Gohlke), now I have corrected it for the slowest 1.26.4 (pip direct from pypi).

@cgohlke
Copy link
Contributor

cgohlke commented Aug 10, 2024

FWIW, I get ~18 ms with numpy+MKL 1.6.4 and 2.0.0 on i7-14700K.

@jankoslavic
Copy link
Author

jankoslavic commented Aug 11, 2024

@cgohlke thank you for this info (and your effort in preparing the compiled packages:)).

Matrix inversion is an essential tool for our group and most processors we tested are from 20-60ms with the only exception beeing the Intel Ultra 7 which performs unaceptable slow (3 orders of magnitude slower)

@andyfaff
Copy link
Member

I'm guessing that the number of threads may play a role here. IIRC the defaults for MKL linked wheels are different to those built against openblas on pypi.
See https://pypi.org/project/threadpoolctl/, https://numpy.org/devdocs/reference/global_state.html#number-of-threads-used-for-linear-algebra

@seberg
Copy link
Member

seberg commented Aug 11, 2024

It may make sense to try with NumPy 2 since it ships a newer OpenBLAS. Also try OPENBLAS_NUM_THREADS=6 since this does sound like a threading issue maybe.
Glancing over, that CPU has low power efficiency cores, but openblas chose 22 threads which includes all cores, if it also pins the threads to the cores that might explain sometehing?! (I am not sure how those cores behave, but 22 threads seems too many for that CPU.)

If upgrading openblas doesn't solve it, we should ping the openblas devs (or open an issue there).

@mattip
Copy link
Member

mattip commented Aug 11, 2024

Something like this will allow you to play with the number of threads in play, and also print out relevant runtime information:

# issue27174.py
import numpy as np
import threadpoolctl
import timeit
import pprint
import argparse


def test_inv(threads=-1):
    with threadpoolctl.threadpool_limits(limits=threads, user_api='blas'):
        info = threadpoolctl.threadpool_info()[0]
        t = timeit.Timer("np.linalg.inv(a)",
                 setup="import numpy as np; a = np.random.rand(1000, 1000)",
                 )
        out = t.timeit(number=4)
        info['benchmark time'] = out
        pprint.pprint(info)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-t", "--threads",
                       help="number of threads to use",
                       default=-1, type=int)
    args = parser.parse_args()
    test_inv(args.threads)
  

Run this as python issue27174.py -t <n> where n is the number of threads to use. You may want to plat with n, trying 1 for single threaded performance, and then going up to 22. On my machine, using n>4 did not get me any benefit. You can try this out as well with NumPy from PyPI (which uses OpenBLAS) as well as from cgholke (which uses MKL). The conda-forge builds let you choose your blas implementation. I think what @andyfaff was suggesting is to limit the number of threads you use, since there is a price to using too many.

Please report back what you find out.

@jankoslavic
Copy link
Author

jankoslavic commented Aug 11, 2024

Thank you all for the prompt response. This is indeed a threading issue. Please see details of the analysis below.

Running the issue27174.py

numpy 1.26.4

Single core:

{'architecture': 'Haswell',
'benchmark time': 0.2935412999941036,
'filepath': 'C:\Users\janko\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy.libs\libopenblas64__v0.3.23-293-gc2f4bdbb-gcc_10_3_0-2bde3a66a51006b2b53eb373ff767a3f.dll',
'internal_api': 'openblas',
'num_threads': 1,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.23.dev'}

multiple cores:

n=1: 0.29s
n=2: 0.22s
n=3: 0.20s
n=4: 0.30s
n=6: 0.33s
n=10: 0.51s
n=14: 0.58s
n=18: 9.24s
n=21: 15s
n=22: 33s <<<<!!!>>>

numpy 2.0.1

Single core:

{'architecture': 'Haswell',
'benchmark time': 0.27034840000851545,
'filepath': 'C:\Users\janko\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy.libs\libscipy_openblas64_-fb1711452d4d8cee9f276fd1449ee5c7.dll',
'internal_api': 'openblas',
'num_threads': 1,
'prefix': 'libscipy_openblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.27'}

multiple cores:

n=1: 0.27s
n=2: 0.25s
n=3: 0.21s
n=4: 0.23s
n=6: 0.18s
n=10: 0.27s
n=14: 0.31s
n=18: 10.8s
n=21: 22.4s
n=22: 23s <<<<!!!>>>

Additional comments

The time explodes after 16 threads. If I use 22 and look at the resource monitor I see 6 cpu-s are still "parked" and no load is applied to them. The description of the processor is 16cores, 22 logical processors.
See also: https://www.techpowerup.com/cpu-specs/core-ultra-7-155h.c3307#:~:text=The%20Intel%20Core%20Ultra%207,a%20total%20of%2022%20threads.

@seberg
Copy link
Member

seberg commented Aug 11, 2024

OK, so 6 are performance cores (with hyper-threading, so x2 threads), 8 are "efficient cores" cores and the last 2 are "low power efficiency cores (https://ark.intel.com/content/www/us/en/ark/products/236847/intel-core-ultra-7-processor-155h-24m-cache-up-to-4-80-ghz.html). But not sure how that explains why performance takes a nose dive after 16 threads...

It seems to me someone with knowledge about this architecture needs to teach OpenBLAS exactly what the right guess for the number of threads is (unless there is even more complexity to it)...
Ping @rdevulap since it might interest you, also ping @martin-frbg, since we seem in OpenBLAS territory now.

@martin-frbg
Copy link

Interesting to see this also happening with the old 0.3.23... I wonder if the numpy-2.0.1 used already has the latest fix for 0.3.27, or if we could me mixing/substituting problems there.. but neither version would have dedicated code for handling the peculiar situation of having three tiers of cores if that has anything to do with it. I bought an i5-125H a few weeks ago but have not run Windows on it...

@seberg
Copy link
Member

seberg commented Aug 11, 2024

IIRC, 2.0.1 would be in a state where it has the weird windows threading bug? Anyway, the 2.1rc just landed, if there might have been a newer fix would be the thing to check. The quick thing would be to try with pip install --pre numpy which should give you 2.1.

@jankoslavic
Copy link
Author

Thank you all.

I have tried to installed pip install --pre numpy, but I cannot find the wheels.

This is what I get when I try to install a specific version

... 2.0.0b1, 2.0.0rc1, 2.0.0rc2, 2.0.0, 2.0.1) ERROR: No matching distribution found for numpy==2.1rc

@seberg
Copy link
Member

seberg commented Aug 11, 2024

Sorry, my bad. The release was tagged, but not uploaded to PyPI yet. You could try the nightlies:

pip install -i https://pypi.anaconda.org/scientific-python-nightly-wheels/simple numpy

(you may need --pre there as well). That pulls in the nightlies, but they are roughly equivalent to 2.1rc.

@mattip
Copy link
Member

mattip commented Aug 11, 2024

Interesting to see this also happening with the old 0.3.23

So it probably is not directly a problem with the threading code perse, more a problem with oversubscription and waiting for the slow cores. I wonder if the same CPU running linux (in WSL even?) exhibits this problem.

I wonder if the numpy-2.0.1 used already has the latest fix for 0.3.27, or ...

NumPy 2.0.1, like NumPy 2.0.0 uses exactly v0.3.27, with no additional patches or fixes. This means it is susceptible to the bug in #27036, so probably not advisable to use in production.

the 2.1rc just landed

I don't think those wheels are available anywhere just yet, maybe in a few days. In any case, the windows threading performance should be similar to the 1.26.4 ones since those wheels use a patched OpenBLAS to revert the problem in #27036.

@jankoslavic
Copy link
Author

jankoslavic commented Aug 11, 2024

Thank you @seberg . I was able to install numpy 2.2.0.dev0

The results are:
{'architecture': 'Haswell',
'benchmark time': 0.24210169998696074,
'filepath': 'C:\Users\janko\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy.libs\libscipy_openblas64_-bd796bfac570d29e6056bd021fef32bd.dll',
'internal_api': 'openblas',
'num_threads': 1,
'prefix': 'libscipy_openblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.27'}

n=1: 0.24s
n=2: 0.16s
n=4: 0.19s
n=6: 0.39s
n=12: 0.67s
n=13: 0.75s
n=18: 9.2s
n=22: 22s <<<<!!!>>>

---update---
With numpy 2.2.0.dev0 and after updating all Intel drivers it explodes at 22 threads, only:
The results are:
{'architecture': 'Haswell',
'benchmark time': 0.23483800000030897,
'filepath': 'C:\Users\janko\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy.libs\libscipy_openblas64_-bd796bfac570d29e6056bd021fef32bd.dll',
'internal_api': 'openblas',
'num_threads': 1,
'prefix': 'libscipy_openblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.27'}

n=1: 0.23s
n=2: 0.14s
n=4: 0.12s
n=6: 0.15s
n=12: 0.57s
n=13: 0.65s
n=18: 0.83s
n=19: 0.80s
n=22: 16s <<<<!!!>>>

@martin-frbg
Copy link

OK, the last result looks a lot less frustrating (thanks for running all these tests !). So part of the problem appears to have been firmware or driver issues, and probably OpenBLAS "only" needs to learn to ignore the ultra-efficient cores (which are probably good enough for keeping Windows background tasks alive, but not much else).
My new i5-125H has the two low-power cores too (4P, 8E, 2LP), so I should be able to work something out (even if the Linux scheduler appeared to be unaffected so far - but I have not yet thrown any numpy codes at it)

@martin-frbg
Copy link

BTW from Intel developer community forum there is no straightforward call to identify the LP cores, but they can be recognized by being unable to access the L3 cache
https://community.intel.com/t5/Processors/Detecting-LP-E-Cores-on-Meteor-Lake-in-software/td-p/1577956 (in practice it may be sufficient for OpenBLAS to subtract 2 from the maximum thread count after finding itself on a Meteor Lake cpu)

@r-devulap
Copy link
Member

r-devulap commented Aug 15, 2024

@jankoslavic I am little confused. Your NumPy output says {'architecture': 'Haswell', but you mention that you have a MeteorLake processor? Is NumPy reporting architecture incorrectly?

EDIT: nevermind, I see the same on my MTL too :/

@r-devulap
Copy link
Member

r-devulap commented Aug 15, 2024

BTW, this doesn't seem to be related to E-core/P-core or anything to do with Windows. I see this behavior on my linux SkylakeX too which has all uniform cores. The threshold seems to be the number of cores available in the system. It slows down quite a bit when you set threadpoolctl.threadpool_limits to be one more than the num of cores available on your processor.

@martin-frbg
Copy link

@r-devulap architecture is reported as "Haswell" as that matches its capabilities as far as OpenBLAS is concerned - AVX2 but no AVX512.
I'm not sure what threadpoolctl.threadpool_limits translates to once numpy ends up calling OpenBLAS, but claiming that your system can run more threads in parallel than it physically provisions for cannot be a good idea - you're likely to get everything waiting for that odd surplus thread to catch up some (most?) of the time.
The Meteor Lake hardware only aggravates this by advertising full-featured thread capabilities on the two low-power E-cores while they're apparently barely able to cope with BLAS workloads.

@r-devulap
Copy link
Member

r-devulap commented Aug 15, 2024

The Meteor Lake hardware only aggravates this by advertising full-featured thread capabilities on the two low-power E-cores while they're apparently barely able to cope with BLAS workloads.

[EDIT]: these numbers are with turbo disabled.

Doesn't look it on my MTL which has 2 P cores, 8 efficient cores and 2 low power efficient cores (total 14 threads). the performance seems just fine till the threads exceed 14:

'num_threads': 1  'benchmark time': 0.5835756147280335,
'num_threads': 2, 'benchmark time': 0.3678867472335696,
'num_threads': 3, 'benchmark time': 0.4594470327720046,
'num_threads': 4, 'benchmark time': 0.3895368203520775,
'num_threads': 5, 'benchmark time': 0.3098673168569803,
'num_threads': 6, 'benchmark time': 0.30408395640552044,
'num_threads': 7, 'benchmark time': 0.259448042139411,
'num_threads': 8, 'benchmark time': 0.24070970434695482,
'num_threads': 9, 'benchmark time': 0.23088357970118523,
'num_threads': 10,'benchmark time': 0.2374553494155407,
'num_threads': 11,'benchmark time': 0.27123847883194685,
'num_threads': 12,'benchmark time': 0.2701326012611389,
'num_threads': 13,'benchmark time': 0.22367281932383776,
'num_threads': 14,'benchmark time': 0.21274131909012794,
'num_threads': 15,'benchmark time': 15.423967558890581,
'num_threads': 16,'benchmark time': 25.803074752911925,

@r-devulap
Copy link
Member

hah, @martin-frbg you are right. Looks like that behavior was with turbo disabled which makes all the cores run at the same low frequency. Once I enable turbo, the P cores and E cores run at a much higher frequency and the low power e cores cant catch up and the performance nose dives.

@martin-frbg
Copy link

The problem is not observable under Linux - it is only when the actual thread count available gets exceeded that the benchmark time goes up drastically. My original tests were with an older system setup (Linuxkernel 6.4) where it went from about 0.1 to 12 seconds. With the latest available (6.11rc3) which contains specific tuning for MeteorLake provided by Intel engineers, it jumps from 0.1 to 3.5 seconds (at 18/19 threads on ultra5-125H).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants