-
-
Notifications
You must be signed in to change notification settings - Fork 11k
my norm is slow #5218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
there are two things that make the ad_hoc variant faster, it is more efficient on the cpu caches and it due to its smaller temporaries do not requires page zeroing in the memory allocator
one could block the code in norm to make it faster. |
I see that 'intel performance primitives' has |
performancewise it makes a lot of sense, there are numerous of these type of operations that are used often, max_abs, square_sum, abs_sum (or related combined similar operations like min_max, sincos etc.) For reductions the api could be something along the lines of this:
a hacky implementation might be to use the data argument of the ufunc innerloop to pass in another ufunc innerloop. |
|
For reference I checked this against a cython implementation that hard-codes the input array ndim as 2 and the axis as 1, and this takes 2.1 seconds, vs. 3.9 for the ad-hoc implementation and 4.6 for the numpy linalg norm. |
It will not hold for all input shapes, but your original example can be sped up comparably to Cython (by my measurements) by doing the following:
It is of course substantially slower for a worst case shape:
So while I am not sure that such a change makes sense, I tend to think that it does. Any thoughts? |
I'd rather not slow down the worst-case shape. |
Are the things mentioned here (worse shapes) also the reason why calculating the 2-norm using np.sqrt(((a[...,0]**2+a[...,1]**2+a[...,2]**2))) is significantly faster then the numpy methods np.linalg.norm(a, axis=-1) or its underlying np.sqrt(np.add.reduce((a.conj() * a).real, axis=0)) ? |
I am going to close this in favor of gh-18483, since optimized functions such as |
For
M = np.random.randn(4000, 4000)
the less-vectorized implementation is a bit faster on my machine (3.9 seconds vs. 4.6 seconds fortimeit(..., number=100)
). This seems weird to me. Is there a faster way to do what I want?The text was updated successfully, but these errors were encountered: