BUG: Very slow execution of linalg.inv on intel ultra 7 #27174

jankoslavic · 2024-08-10T19:36:38Z

Describe the issue:

I am testing this code:

import numpy as np
a = np.random.rand(1000, 1000)
ainv = np.linalg.inv(a)

on an Intel ultra 7 processor (155H).
This should take approx 30-50ms, but it takes 12000ms on numpy 1.26.4 or 2.0 (using pip install numpy). This is really crazy slow!!!

If I unstalled the version 1.26.3 with MKL support from Gohlke:
https://github.com/cgohlke/numpy-mkl-wheels/releases/tag/v2024.1.3
the timing was appox 2000ms, whic is still 100x more than normal.

If I installed Anaconda (which I do not prefer), I have reached 30-50ms.

I am doing anything wrong?

Reproduce the code example:

import numpy as np
a = np.random.rand(1000, 1000)
ainv = np.linalg.inv(a)

Error message:

No response

Python and NumPy Versions:

1.26.3
3.12.5 (tags/v3.12.5:ff3bc82, Aug 6 2024, 20:45:27) [MSC v.1940 64 bit (AMD64)]

Runtime Environment:

[{'numpy_version': '1.26.4',
'python': '3.12.5 (tags/v3.12.5:ff3bc82, Aug 6 2024, 20:45:27) [MSC v.1940 '
'64 bit (AMD64)]',
'uname': uname_result(system='Windows', node='Spectre-JS', release='11', version='10.0.22631', machine='AMD64')},
{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}},
{'architecture': 'Haswell',
'filepath': 'C:\Users\janko\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy.libs\libopenblas64__v0.3.23-293-gc2f4bdbb-gcc_10_3_0-2bde3a66a51006b2b53eb373ff767a3f.dll',
'internal_api': 'openblas',
'num_threads': 22,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.23.dev'}]

Context for the issue:

Matrix inversion is essential to me. Such slow inversion makes my code unusable or I have to go for Anaconda.

The text was updated successfully, but these errors were encountered:

matthew-brett · 2024-08-10T19:41:06Z

Just checking - have you given the runtime environment from the relevant (slowest) install?

jankoslavic · 2024-08-10T19:57:52Z

Thank you for the prompt check. Sorry, no it was initially for 1.26.3 (Gohlke), now I have corrected it for the slowest 1.26.4 (pip direct from pypi).

cgohlke · 2024-08-10T22:42:48Z

FWIW, I get ~18 ms with numpy+MKL 1.6.4 and 2.0.0 on i7-14700K.

jankoslavic · 2024-08-11T03:43:34Z

@cgohlke thank you for this info (and your effort in preparing the compiled packages:)).

Matrix inversion is an essential tool for our group and most processors we tested are from 20-60ms with the only exception beeing the Intel Ultra 7 which performs unaceptable slow (3 orders of magnitude slower)

andyfaff · 2024-08-11T06:03:27Z

I'm guessing that the number of threads may play a role here. IIRC the defaults for MKL linked wheels are different to those built against openblas on pypi.
See https://pypi.org/project/threadpoolctl/, https://numpy.org/devdocs/reference/global_state.html#number-of-threads-used-for-linear-algebra

seberg · 2024-08-11T06:47:53Z

It may make sense to try with NumPy 2 since it ships a newer OpenBLAS. Also try OPENBLAS_NUM_THREADS=6 since this does sound like a threading issue maybe.
Glancing over, that CPU has low power efficiency cores, but openblas chose 22 threads which includes all cores, if it also pins the threads to the cores that might explain sometehing?! (I am not sure how those cores behave, but 22 threads seems too many for that CPU.)

If upgrading openblas doesn't solve it, we should ping the openblas devs (or open an issue there).

mattip · 2024-08-11T06:48:40Z

Something like this will allow you to play with the number of threads in play, and also print out relevant runtime information:

# issue27174.py
import numpy as np
import threadpoolctl
import timeit
import pprint
import argparse


def test_inv(threads=-1):
    with threadpoolctl.threadpool_limits(limits=threads, user_api='blas'):
        info = threadpoolctl.threadpool_info()[0]
        t = timeit.Timer("np.linalg.inv(a)",
                 setup="import numpy as np; a = np.random.rand(1000, 1000)",
                 )
        out = t.timeit(number=4)
        info['benchmark time'] = out
        pprint.pprint(info)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-t", "--threads",
                       help="number of threads to use",
                       default=-1, type=int)
    args = parser.parse_args()
    test_inv(args.threads)

Run this as python issue27174.py -t <n> where n is the number of threads to use. You may want to plat with n, trying 1 for single threaded performance, and then going up to 22. On my machine, using n>4 did not get me any benefit. You can try this out as well with NumPy from PyPI (which uses OpenBLAS) as well as from cgholke (which uses MKL). The conda-forge builds let you choose your blas implementation. I think what @andyfaff was suggesting is to limit the number of threads you use, since there is a price to using too many.

Please report back what you find out.

jankoslavic · 2024-08-11T07:26:13Z

Thank you all for the prompt response. This is indeed a threading issue. Please see details of the analysis below.

Running the issue27174.py

numpy 1.26.4

Single core:

{'architecture': 'Haswell',
'benchmark time': 0.2935412999941036,
'filepath': 'C:\Users\janko\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy.libs\libopenblas64__v0.3.23-293-gc2f4bdbb-gcc_10_3_0-2bde3a66a51006b2b53eb373ff767a3f.dll',
'internal_api': 'openblas',
'num_threads': 1,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.23.dev'}

multiple cores:

n=1: 0.29s
n=2: 0.22s
n=3: 0.20s
n=4: 0.30s
n=6: 0.33s
n=10: 0.51s
n=14: 0.58s
n=18: 9.24s
n=21: 15s
n=22: 33s <<<<!!!>>>

numpy 2.0.1

Single core:

{'architecture': 'Haswell',
'benchmark time': 0.27034840000851545,
'filepath': 'C:\Users\janko\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy.libs\libscipy_openblas64_-fb1711452d4d8cee9f276fd1449ee5c7.dll',
'internal_api': 'openblas',
'num_threads': 1,
'prefix': 'libscipy_openblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.27'}

multiple cores:

n=1: 0.27s
n=2: 0.25s
n=3: 0.21s
n=4: 0.23s
n=6: 0.18s
n=10: 0.27s
n=14: 0.31s
n=18: 10.8s
n=21: 22.4s
n=22: 23s <<<<!!!>>>

Additional comments

The time explodes after 16 threads. If I use 22 and look at the resource monitor I see 6 cpu-s are still "parked" and no load is applied to them. The description of the processor is 16cores, 22 logical processors.
See also: https://www.techpowerup.com/cpu-specs/core-ultra-7-155h.c3307#:~:text=The%20Intel%20Core%20Ultra%207,a%20total%20of%2022%20threads.

seberg · 2024-08-11T08:40:48Z

OK, so 6 are performance cores (with hyper-threading, so x2 threads), 8 are "efficient cores" cores and the last 2 are "low power efficiency cores (https://ark.intel.com/content/www/us/en/ark/products/236847/intel-core-ultra-7-processor-155h-24m-cache-up-to-4-80-ghz.html). But not sure how that explains why performance takes a nose dive after 16 threads...

It seems to me someone with knowledge about this architecture needs to teach OpenBLAS exactly what the right guess for the number of threads is (unless there is even more complexity to it)...
Ping @rdevulap since it might interest you, also ping @martin-frbg, since we seem in OpenBLAS territory now.

martin-frbg · 2024-08-11T09:11:10Z

Interesting to see this also happening with the old 0.3.23... I wonder if the numpy-2.0.1 used already has the latest fix for 0.3.27, or if we could me mixing/substituting problems there.. but neither version would have dedicated code for handling the peculiar situation of having three tiers of cores if that has anything to do with it. I bought an i5-125H a few weeks ago but have not run Windows on it...

seberg · 2024-08-11T09:28:34Z

IIRC, 2.0.1 would be in a state where it has the weird windows threading bug? Anyway, the 2.1rc just landed, if there might have been a newer fix would be the thing to check. The quick thing would be to try with pip install --pre numpy which should give you 2.1.

jankoslavic · 2024-08-11T10:28:40Z

Thank you all.

I have tried to installed pip install --pre numpy, but I cannot find the wheels.

This is what I get when I try to install a specific version

... 2.0.0b1, 2.0.0rc1, 2.0.0rc2, 2.0.0, 2.0.1) ERROR: No matching distribution found for numpy==2.1rc

seberg · 2024-08-11T10:42:53Z

Sorry, my bad. The release was tagged, but not uploaded to PyPI yet. You could try the nightlies:

pip install -i https://pypi.anaconda.org/scientific-python-nightly-wheels/simple numpy

(you may need --pre there as well). That pulls in the nightlies, but they are roughly equivalent to 2.1rc.

mattip · 2024-08-11T10:44:58Z

Interesting to see this also happening with the old 0.3.23

So it probably is not directly a problem with the threading code perse, more a problem with oversubscription and waiting for the slow cores. I wonder if the same CPU running linux (in WSL even?) exhibits this problem.

I wonder if the numpy-2.0.1 used already has the latest fix for 0.3.27, or ...

NumPy 2.0.1, like NumPy 2.0.0 uses exactly v0.3.27, with no additional patches or fixes. This means it is susceptible to the bug in #27036, so probably not advisable to use in production.

the 2.1rc just landed

I don't think those wheels are available anywhere just yet, maybe in a few days. In any case, the windows threading performance should be similar to the 1.26.4 ones since those wheels use a patched OpenBLAS to revert the problem in #27036.

jankoslavic · 2024-08-11T11:51:55Z

Thank you @seberg . I was able to install numpy 2.2.0.dev0

The results are:
{'architecture': 'Haswell',
'benchmark time': 0.24210169998696074,
'filepath': 'C:\Users\janko\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy.libs\libscipy_openblas64_-bd796bfac570d29e6056bd021fef32bd.dll',
'internal_api': 'openblas',
'num_threads': 1,
'prefix': 'libscipy_openblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.27'}

n=1: 0.24s
n=2: 0.16s
n=4: 0.19s
n=6: 0.39s
n=12: 0.67s
n=13: 0.75s
n=18: 9.2s
n=22: 22s <<<<!!!>>>

---update---
With numpy 2.2.0.dev0 and after updating all Intel drivers it explodes at 22 threads, only:
The results are:
{'architecture': 'Haswell',
'benchmark time': 0.23483800000030897,
'filepath': 'C:\Users\janko\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy.libs\libscipy_openblas64_-bd796bfac570d29e6056bd021fef32bd.dll',
'internal_api': 'openblas',
'num_threads': 1,
'prefix': 'libscipy_openblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.27'}

n=1: 0.23s
n=2: 0.14s
n=4: 0.12s
n=6: 0.15s
n=12: 0.57s
n=13: 0.65s
n=18: 0.83s
n=19: 0.80s
n=22: 16s <<<<!!!>>>

martin-frbg · 2024-08-15T15:24:00Z

OK, the last result looks a lot less frustrating (thanks for running all these tests !). So part of the problem appears to have been firmware or driver issues, and probably OpenBLAS "only" needs to learn to ignore the ultra-efficient cores (which are probably good enough for keeping Windows background tasks alive, but not much else).
My new i5-125H has the two low-power cores too (4P, 8E, 2LP), so I should be able to work something out (even if the Linux scheduler appeared to be unaffected so far - but I have not yet thrown any numpy codes at it)

martin-frbg · 2024-08-15T15:47:11Z

BTW from Intel developer community forum there is no straightforward call to identify the LP cores, but they can be recognized by being unable to access the L3 cache
https://community.intel.com/t5/Processors/Detecting-LP-E-Cores-on-Meteor-Lake-in-software/td-p/1577956 (in practice it may be sufficient for OpenBLAS to subtract 2 from the maximum thread count after finding itself on a Meteor Lake cpu)

r-devulap · 2024-08-15T16:29:35Z

~~@jankoslavic I am little confused. Your NumPy output says {'architecture': 'Haswell', but you mention that you have a MeteorLake processor? Is NumPy reporting architecture incorrectly?~~

EDIT: nevermind, I see the same on my MTL too :/

r-devulap · 2024-08-15T16:53:55Z

BTW, this doesn't seem to be related to E-core/P-core or anything to do with Windows. I see this behavior on my linux SkylakeX too which has all uniform cores. The threshold seems to be the number of cores available in the system. It slows down quite a bit when you set threadpoolctl.threadpool_limits to be one more than the num of cores available on your processor.

martin-frbg · 2024-08-15T17:28:30Z

@r-devulap architecture is reported as "Haswell" as that matches its capabilities as far as OpenBLAS is concerned - AVX2 but no AVX512.
I'm not sure what threadpoolctl.threadpool_limits translates to once numpy ends up calling OpenBLAS, but claiming that your system can run more threads in parallel than it physically provisions for cannot be a good idea - you're likely to get everything waiting for that odd surplus thread to catch up some (most?) of the time.
The Meteor Lake hardware only aggravates this by advertising full-featured thread capabilities on the two low-power E-cores while they're apparently barely able to cope with BLAS workloads.

r-devulap · 2024-08-15T17:54:33Z

The Meteor Lake hardware only aggravates this by advertising full-featured thread capabilities on the two low-power E-cores while they're apparently barely able to cope with BLAS workloads.

[EDIT]: these numbers are with turbo disabled.

Doesn't look it on my MTL which has 2 P cores, 8 efficient cores and 2 low power efficient cores (total 14 threads). the performance seems just fine till the threads exceed 14:

'num_threads': 1  'benchmark time': 0.5835756147280335,
'num_threads': 2, 'benchmark time': 0.3678867472335696,
'num_threads': 3, 'benchmark time': 0.4594470327720046,
'num_threads': 4, 'benchmark time': 0.3895368203520775,
'num_threads': 5, 'benchmark time': 0.3098673168569803,
'num_threads': 6, 'benchmark time': 0.30408395640552044,
'num_threads': 7, 'benchmark time': 0.259448042139411,
'num_threads': 8, 'benchmark time': 0.24070970434695482,
'num_threads': 9, 'benchmark time': 0.23088357970118523,
'num_threads': 10,'benchmark time': 0.2374553494155407,
'num_threads': 11,'benchmark time': 0.27123847883194685,
'num_threads': 12,'benchmark time': 0.2701326012611389,
'num_threads': 13,'benchmark time': 0.22367281932383776,
'num_threads': 14,'benchmark time': 0.21274131909012794,
'num_threads': 15,'benchmark time': 15.423967558890581,
'num_threads': 16,'benchmark time': 25.803074752911925,

r-devulap · 2024-08-15T18:08:27Z

hah, @martin-frbg you are right. Looks like that behavior was with turbo disabled which makes all the cores run at the same low frequency. Once I enable turbo, the P cores and E cores run at a much higher frequency and the low power e cores cant catch up and the performance nose dives.

martin-frbg · 2024-08-18T08:29:55Z

The problem is not observable under Linux - it is only when the actual thread count available gets exceeded that the benchmark time goes up drastically. My original tests were with an older system setup (Linuxkernel 6.4) where it went from about 0.1 to 12 seconds. With the latest available (6.11rc3) which contains specific tuning for MeteorLake provided by Intel engineers, it jumps from 0.1 to 3.5 seconds (at 18/19 threads on ultra5-125H).

jankoslavic added the 00 - Bug label Aug 10, 2024

positr0nium mentioned this issue Aug 21, 2024

tensordot almost 2 magnitudes slower with numpy 2.0 #27260

Closed

mattip mentioned this issue Dec 2, 2024

DOC: Should we create an FAQ (or see if we already have a place) #27871

Open

DeltaNeverUsed mentioned this issue May 25, 2025

fix: serious performance issue with numpy linalg.inv on linux EyeTrackVR/EyeTrackVR#142

Open

1 task

Uh oh!

BUG: Very slow execution of linalg.inv on intel ultra 7 #27174

BUG: Very slow execution of linalg.inv on intel ultra 7 #27174

Comments

jankoslavic commented Aug 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue:

Reproduce the code example:

Error message:

Python and NumPy Versions:

Runtime Environment:

Context for the issue:

matthew-brett commented Aug 10, 2024

Uh oh!

jankoslavic commented Aug 10, 2024

Uh oh!

cgohlke commented Aug 10, 2024

Uh oh!

jankoslavic commented Aug 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyfaff commented Aug 11, 2024

Uh oh!

seberg commented Aug 11, 2024

Uh oh!

mattip commented Aug 11, 2024

Uh oh!

jankoslavic commented Aug 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Running the issue27174.py

numpy 1.26.4

Single core:

multiple cores:

numpy 2.0.1

Single core:

multiple cores:

Additional comments

Uh oh!

seberg commented Aug 11, 2024

Uh oh!

martin-frbg commented Aug 11, 2024

Uh oh!

seberg commented Aug 11, 2024

Uh oh!

jankoslavic commented Aug 11, 2024

Uh oh!

seberg commented Aug 11, 2024

Uh oh!

mattip commented Aug 11, 2024

Uh oh!

jankoslavic commented Aug 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Aug 15, 2024

Uh oh!

martin-frbg commented Aug 15, 2024

Uh oh!

r-devulap commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Aug 15, 2024

Uh oh!

r-devulap commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Aug 15, 2024

Uh oh!

martin-frbg commented Aug 18, 2024

Uh oh!

jankoslavic commented Aug 10, 2024 •

edited

Loading

jankoslavic commented Aug 11, 2024 •

edited

Loading

jankoslavic commented Aug 11, 2024 •

edited

Loading

jankoslavic commented Aug 11, 2024 •

edited

Loading

r-devulap commented Aug 15, 2024 •

edited

Loading

r-devulap commented Aug 15, 2024 •

edited

Loading

r-devulap commented Aug 15, 2024 •

edited

Loading