Native code order of magnitude slower than translated code on Apple M1

I realize numpy is using experimental compilers for native builds on the M1, and still has some [bugs](https://github.com/numpy/numpy/issues/17964), so it might be premature to discuss optimizations. Perhaps this is a feature request and not a bug. However, one would expect that native ARM code would typically be at least as fast as translated x86-64 code. I noticed that the [nibabel bench_finite_range.py](https://github.com/nipy/nibabel) test is much slower for the native code than translated code. I found translated code (Python 3.8.3, NumPy version 1.19.4) is x10 faster than native code (Python 3.9.1rc1, NumPy version 1.19.4)
### Reproducing code example:

```#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import numpy as np
from numpy.testing import measure
#example where translated code (Python 3.8.3, NumPy version 1.19.4) is x10 faster than native code (Python 3.9.1rc1, NumPy version 1.19.4)
rng = np.random.RandomState(20111001)
img_shape = (128, 128, 64, 10)
repeat = 100
arr = rng.normal(size=img_shape)
mtime = measure('np.max(arr)', repeat)
print('%30s %6.2f' % ('max all finite', mtime))
mtime = measure('np.min(arr)', repeat)
print('%30s %6.2f' % ('min all finite', mtime))
arr[:, :, :, 1] = np.nan
mtime = measure('np.max(arr)', repeat)
print('%30s %6.2f' % ('max all nan', mtime))
mtime = measure('np.min(arr)', repeat)
print('%30s %6.2f' % ('min all nan', mtime))
```

### Performance:

Translated:
```
$ time ./numpy_native_slower_than_translated.py
                max all finite   0.18
                min all finite   0.18
                   max all nan   0.18
                   min all nan   0.19
./numpy_native_slower_than_translated.py  1.32s user 1.28s system 214% cpu 1.213 total
```

Native:
```
$ time ./numpy_native_slower_than_translated.py
                max all finite   1.98
                min all finite   1.99
                   max all nan   1.99
                   min all nan   1.98
./numpy_native_slower_than_translated.py  8.49s user 0.14s system 104% cpu 8.237 total
```

### NumPy/Python version information:



Translated:
 - 1.19.4 3.8.3 (default, May 19 2020, 13:54:14) 
[Clang 10.0.0 ]

Native:
 -  1.19.4 3.9.1rc1 | packaged by conda-forge | (default, Nov 28 2020, 22:21:58) 
[Clang 11.0.0 ]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Native code order of magnitude slower than translated code on Apple M1 #17989

Reproducing code example:

Performance:

NumPy/Python version information:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Native code order of magnitude slower than translated code on Apple M1 #17989

Description

Reproducing code example:

Performance:

NumPy/Python version information:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions