Closed
Description
I realize numpy is using experimental compilers for native builds on the M1, and still has some bugs, so it might be premature to discuss optimizations. Perhaps this is a feature request and not a bug. However, one would expect that native ARM code would typically be at least as fast as translated x86-64 code. I noticed that the nibabel bench_finite_range.py test is much slower for the native code than translated code. I found translated code (Python 3.8.3, NumPy version 1.19.4) is x10 faster than native code (Python 3.9.1rc1, NumPy version 1.19.4)
Reproducing code example:
# -*- coding: utf-8 -*-
import numpy as np
from numpy.testing import measure
#example where translated code (Python 3.8.3, NumPy version 1.19.4) is x10 faster than native code (Python 3.9.1rc1, NumPy version 1.19.4)
rng = np.random.RandomState(20111001)
img_shape = (128, 128, 64, 10)
repeat = 100
arr = rng.normal(size=img_shape)
mtime = measure('np.max(arr)', repeat)
print('%30s %6.2f' % ('max all finite', mtime))
mtime = measure('np.min(arr)', repeat)
print('%30s %6.2f' % ('min all finite', mtime))
arr[:, :, :, 1] = np.nan
mtime = measure('np.max(arr)', repeat)
print('%30s %6.2f' % ('max all nan', mtime))
mtime = measure('np.min(arr)', repeat)
print('%30s %6.2f' % ('min all nan', mtime))
Performance:
Translated:
$ time ./numpy_native_slower_than_translated.py
max all finite 0.18
min all finite 0.18
max all nan 0.18
min all nan 0.19
./numpy_native_slower_than_translated.py 1.32s user 1.28s system 214% cpu 1.213 total
Native:
$ time ./numpy_native_slower_than_translated.py
max all finite 1.98
min all finite 1.99
max all nan 1.99
min all nan 1.98
./numpy_native_slower_than_translated.py 8.49s user 0.14s system 104% cpu 8.237 total
NumPy/Python version information:
Translated:
- 1.19.4 3.8.3 (default, May 19 2020, 13:54:14)
[Clang 10.0.0 ]
Native:
- 1.19.4 3.9.1rc1 | packaged by conda-forge | (default, Nov 28 2020, 22:21:58)
[Clang 11.0.0 ]
Metadata
Metadata
Assignees
Labels
No labels