ENH: Integrate Optimized Routines for AArch64 #23171

Mousius · 2023-02-07T17:14:45Z

Adds the initial support for using the Optimized Routines library to improve performance on AArch64.

cos and sin are implemented to demonstrate the flow through to the library calls, more will be added in a follow up patch to align with the existing SVML integration.

I've updated both setup.py and meson.build, but I'm unsure which gets triggered when 🤔

16:47:00  -         729±2μs        397±0.4μs     0.54  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 1, 'd')
16:47:00  -        732±10μs          414±6μs     0.56  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 2, 'd')
16:47:00  -        776±20μs         436±10μs     0.56  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 4, 'd')
16:47:00  -         731±2μs        419±0.3μs     0.57  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 1, 'd')
16:47:00  -         732±3μs        439±0.5μs     0.60  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 2, 'd')
16:47:00  -        735±20μs         440±10μs     0.60  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 4, 'd')
16:47:00  -         736±1μs          420±1μs     0.57  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 1, 'd')
16:47:00  -       736±0.7μs          441±1μs     0.60  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 2, 'd')
16:47:00  -         736±2μs          442±2μs     0.60  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 4, 'd')
16:47:00  -         871±5μs        395±0.3μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 1, 'd')
16:47:00  -        876±10μs          413±6μs     0.47  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 2, 'd')
16:47:00  -        928±30μs         436±10μs     0.47  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 4, 'd')
16:47:00  -         871±5μs        417±0.2μs     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 1, 'd')
16:47:00  -         874±2μs        430±0.2μs     0.49  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 2, 'd')
16:47:00  -        876±20μs         430±10μs     0.49  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 4, 'd')
16:47:00  -         876±4μs          419±1μs     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 1, 'd')
16:47:00  -         872±2μs          431±2μs     0.49  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 2, 'd')
16:47:00  -         875±6μs          434±2μs     0.50  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 4, 'd')

See: https://mail.python.org/archives/list/[email protected]/message/GTHX4TFRUCGQI2VPHEWMEC4GBOAOOH4C/

Adds the initial support for using the Optimized Routines library to improve performance on AArch64. `cos` and `sin` are implemented to demonstrate the flow through to the library calls, more will be added in a follow up patch. See: https://mail.python.org/archives/list/[email protected]/message/GTHX4TFRUCGQI2VPHEWMEC4GBOAOOH4C/ Change-Id: Idb5fb312313e5577cc8db0edbef02707fabd7006

PierreBlanchard

Just some nits for now.

PierreBlanchard · 2023-02-07T17:50:26Z

numpy/core/src/umath/loops_umath_fp.dispatch.c.src

- ** $maxopt baseline avx512_skx
+ ** $maxopt baseline
+ ** avx512_skx
+ ** asimd


I think advsimd is the preferred name.

This is the standard naming within numpy itself, I don't think there's a strong enough reason to change it?

numpy/core/src/umath/loops_umath_fp.dispatch.c.src

joeramsay

Slightly apologetic comment - AOR/pl has recently removed some symbols, and likely AOR/math will at some point soon as well, so the way function names are chosen is not future-proof (and won't work at all if you want any routines from pl/). How are you building this? I don't see any config.mk file in this patch, so interested in how AOR is being configured?

joeramsay · 2023-02-08T09:49:00Z

numpy/core/src/umath/loops_umath_fp.dispatch.c.src

@@ -74,7 +92,11 @@ simd_@func@_f64(const double *src, npy_intp ssrc,
        } else {
            x = npyv_loadn_tillz_f64(src, ssrc, len);
        }
+        #if defined(_NPY_OPTIMIZED_ROUTINES)
+        npyv_f64 out = __v_@func@(x);


In the pl/ subdir of AOR, and likely in math/ as well, the __v_-prefixed symbols are being removed (see for example this commit). From now on the only reliably supported names are the VFABI-mangled variants, so I don't think this line is future-proof

Argh, this makes it slightly more irritating than the existing SVML integration, I can change these when I integrate the next set of functions though.

Mousius · 2023-02-08T12:18:58Z

Slightly apologetic comment - AOR/pl has recently removed some symbols, and likely AOR/math will at some point soon as well, so the way function names are chosen is not future-proof (and won't work at all if you want any routines from pl/). How are you building this? I don't see any config.mk file in this patch, so interested in how AOR is being configured?

I'm just building the bits I need as a submodule in numpy rather than taking the whole library in, that would require additional effort that doesn't appear to be justified?

joeramsay · 2023-02-08T13:19:07Z

Slightly apologetic comment - AOR/pl has recently removed some symbols, and likely AOR/math will at some point soon as well, so the way function names are chosen is not future-proof (and won't work at all if you want any routines from pl/). How are you building this? I don't see any config.mk file in this patch, so interested in how AOR is being configured?

I'm just building the bits I need as a submodule in numpy rather than taking the whole library in, that would require additional effort that doesn't appear to be justified?

So the C files are just compiled independently of the AOR Makefiles? Makes sense, but since AOR is designed to be build with a config file, you may end up missing certain features (or even routines) that are enabled based on settings in the config file, for instance without -DWANT_SIMD_EXCEPT=1 you will not have fp exceptions triggered correctly. My understanding is that numpy requires this, so I'm surprised not to see it mentioned here

Mousius · 2023-02-08T14:30:35Z

Slightly apologetic comment - AOR/pl has recently removed some symbols, and likely AOR/math will at some point soon as well, so the way function names are chosen is not future-proof (and won't work at all if you want any routines from pl/). How are you building this? I don't see any config.mk file in this patch, so interested in how AOR is being configured?

I'm just building the bits I need as a submodule in numpy rather than taking the whole library in, that would require additional effort that doesn't appear to be justified?

So the C files are just compiled independently of the AOR Makefiles? Makes sense, but since AOR is designed to be build with a config file, you may end up missing certain features (or even routines) that are enabled based on settings in the config file, for instance without -DWANT_SIMD_EXCEPT=1 you will not have fp exceptions triggered correctly. My understanding is that numpy requires this, so I'm surprised not to see it mentioned here

Likewise, but if the tests are passing I would consider that sufficient for now, I'm keen not to add too many additional flags if I can avoid it especially given they're not namespaced to Optimized Routines so could cause other software to compile weirdly.

joeramsay · 2023-02-08T15:47:09Z

Slightly apologetic comment - AOR/pl has recently removed some symbols, and likely AOR/math will at some point soon as well, so the way function names are chosen is not future-proof (and won't work at all if you want any routines from pl/). How are you building this? I don't see any config.mk file in this patch, so interested in how AOR is being configured?

I'm just building the bits I need as a submodule in numpy rather than taking the whole library in, that would require additional effort that doesn't appear to be justified?

So the C files are just compiled independently of the AOR Makefiles? Makes sense, but since AOR is designed to be build with a config file, you may end up missing certain features (or even routines) that are enabled based on settings in the config file, for instance without -DWANT_SIMD_EXCEPT=1 you will not have fp exceptions triggered correctly. My understanding is that numpy requires this, so I'm surprised not to see it mentioned here

Likewise, but if the tests are passing I would consider that sufficient for now, I'm keen not to add too many additional flags if I can avoid it especially given they're not namespaced to Optimized Routines so could cause other software to compile weirdly.

I understand what you're saying, but I think this could become very difficult to maintain for a couple of reasons. One is that AOR routines are not always completely self-contained in one source file - there may be helper routines or coefficient arrays in a different TU (not the case for sin/cos, so it was fine for this patch). These get moved around from time to time - we don't make any promises about what files will exist, and indeed we are in the process of shuffling some things around. The VFABI-mangled symbols don't come from v_<func>.c at all, they come from vn_<func>.c but again, we are in the process of changing this. The only way to reliably get the symbols you need is by building the whole library.

As well, there are certain flags that at some point you will need in your compiler invocation, for instance you can't build SVE routines without explicitly enabling them, either in the config file or with -DWANT_SVE_MATH=1.

Mousius · 2023-02-08T16:28:29Z

Slightly apologetic comment - AOR/pl has recently removed some symbols, and likely AOR/math will at some point soon as well, so the way function names are chosen is not future-proof (and won't work at all if you want any routines from pl/). How are you building this? I don't see any config.mk file in this patch, so interested in how AOR is being configured?

I'm just building the bits I need as a submodule in numpy rather than taking the whole library in, that would require additional effort that doesn't appear to be justified?

So the C files are just compiled independently of the AOR Makefiles? Makes sense, but since AOR is designed to be build with a config file, you may end up missing certain features (or even routines) that are enabled based on settings in the config file, for instance without -DWANT_SIMD_EXCEPT=1 you will not have fp exceptions triggered correctly. My understanding is that numpy requires this, so I'm surprised not to see it mentioned here

Likewise, but if the tests are passing I would consider that sufficient for now, I'm keen not to add too many additional flags if I can avoid it especially given they're not namespaced to Optimized Routines so could cause other software to compile weirdly.

I understand what you're saying, but I think this could become very difficult to maintain for a couple of reasons. One is that AOR routines are not always completely self-contained in one source file - there may be helper routines or coefficient arrays in a different TU (not the case for sin/cos, so it was fine for this patch). These get moved around from time to time - we don't make any promises about what files will exist, and indeed we are in the process of shuffling some things around. The VFABI-mangled symbols don't come from v_<func>.c at all, they come from vn_<func>.c but again, we are in the process of changing this. The only way to reliably get the symbols you need is by building the whole library.

As well, there are certain flags that at some point you will need in your compiler invocation, for instance you can't build SVE routines without explicitly enabling them, either in the config file or with -DWANT_SVE_MATH=1.

I think that's ok as from the brief discussion with @mattip, the focus should be on moving to universal intrinsics in numpy. AOR is a good intermediate step for the performance boost but we can avoid long-term dependence on it if we can't minimally include it. In the intermediary state we can accept having the files as-is and a slightly painful upgrade path, that'll likely land on me 😸

mattip · 2023-02-08T17:24:17Z

Perusing the source repo, I came across this file. Is this the implementation of cos? It has this comment:

/* worst-case error is 3.5 ulp.
   abs error: 0x1.be222a58p-53 in [-pi/2, pi/2].  */

I would think that would exceed our tests, so I must be looking at the wrong code.

Mousius · 2023-02-08T18:31:10Z

Perusing the source repo, I came across this file. Is this the implementation of cos? It has this comment:
/* worst-case error is 3.5 ulp.
   abs error: 0x1.be222a58p-53 in [-pi/2, pi/2].  */
I would think that would exceed our tests, so I must be looking at the wrong code.

Interesting as the original SVML implementation is 4ULP (#19478)

mattip · 2023-02-08T19:00:57Z

We discussed this when adding the validation tests, starting around here in the PR. Here is what cpu features are used in CI. Are the new code paths triggered there?

mattip · 2023-02-08T19:02:02Z

I've updated both setup.py and meson.build, but I'm unsure which gets triggered when

We are transitioning from setup.py to meson. The aarch64 CI run is still using setup.py.

Mousius · 2023-02-08T19:51:38Z

We discussed this when adding the validation tests, starting around here in the PR. Here is what cpu features are used in CI. Are the new code paths triggered there?

They should be, as I had to add the git submodule update --init line into Cirrus for the cibuildwheel build to work 😸

mattip · 2023-02-08T20:00:13Z

Adding the submodule means the routines were compiled in. But how do we know they were used in the runtime? The only information we get is that the CPU detection kicked in, which should then choose the correct inner loops. Which features are required for these routines?

mattip · 2023-02-08T20:05:14Z

tools/ci/cirrus_general.yml

@@ -1,3 +1,5 @@
+# Copyright 2023 Arm Limited and/or its affiliates <[email protected]>
+


Why are you adding a copyright to this file?

For the small amount of additional logic changed to pull the submodule.

I understand that legal teams like copyright notices, but we don't have a habit of adding copyright notices to files and at least right now I would prefer to keep it that way and you explain this to the legal team at ARM.

If we really did some of these files would be littered by personal or company copyrights (quansight, nvidia, intel, apple, ...) and probably even more universities...

If the legal team needs a more definite no from us, I suspect we can just give that.

@seberg I've raised this with our team and I'll let you know the outcome, I have to try and follow best practices first but it's relatively common for us to have a reasonable discussion as to what suits the project 😸

I would point out that there are a number of explicit copyright claims through-out the codebase which imply this is a practice, maybe worth consideration of how we make that more consistent?

Mousius · 2023-02-08T23:43:05Z

Adding the submodule means the routines were compiled in. But how do we know they were used in the runtime? The only information we get is that the CPU detection kicked in, which should then choose the correct inner loops. Which features are required for these routines?

They all trigger from ASIMD, which should be standard on AArch64 machines.

mattip · 2023-02-09T12:19:04Z

Indeed, ASIMD is in the set of required baseline features. The cirrusCI machines report:

NumPy CPU features:  NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM?

Mousius · 2023-02-09T15:25:31Z

Indeed, ASIMD is in the set of required baseline features. The cirrusCI machines report:
NumPy CPU features:  NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM?

Hmm, so I'm guessing this is all fine in terms of the tests or are there further concerns from your point of view?

PierreBlanchard · 2023-02-10T09:20:01Z

We discussed this when adding the validation tests, starting around here in the PR. Here is what cpu features are used in CI. Are the new code paths triggered there?

Regarding accuracy our design principle is to have a threshold of 3.5ULPs over the entire range of libm routines, like SVML has a 4ULPs threshold. Unfortunately we cannot provide better accuracy than a given reference on a per routine basis, hence why all libraries use a single threshold (glibc's libmvec has 4ULPs too).
Are you using the same error threshold for testing SVML and the regular scalar implementation?

Am I right thinking your test thresholds are based on your evaluation of ULP errors, or are these ULP errors provided by SVML? Evaluating the maximum error for double precision routines is notoriously hard and fairly inaccurate by sampling the whole domain or even a reduced interval randomly (even with billions of points).

This is also a reason why AOR routines might appear to pass the tests even with maximum errors larger than the current threshold. A random set of test inputs will likely not trigger the worst cases even with millions of inputs.

Let me know if this discussion is better suited for this PR.

PierreBlanchard · 2023-03-15T17:21:09Z

Hi @mattip !
I have a few naive questions about this PR.
What is going to happen to these routines once ported to numpy intrinsics? How are they going to be used?
Is the point to be able to use them on other architectures than Arm? Or is there more to it?
Are you still going to keep SVML implementations? And use them on AVX512 enabled machines then? Are you going to pick the fastest implementation?

How are they going to be maintained? By who? Since AOR implementations are continuously improving, they might diverge from the last drop in quite significantly. So each update would basically consist in doing a full port again from AOR to numpy intrinsics. The situation would become even much worse if they were also modified from within numpy.

An option to keep maintenance low would be to provide an AOR to Numpy interface. We just got rid of this feature but might be able to re-introduce in a different way, if necessary. The code would still have to be configured (pre-processed) so it has the features Numpy expects.

Numpy intrinsics seems like an interesting concept, just curious about the implications when trying to "upstream" external work into Numpy using this language.

mattip · 2023-03-15T22:21:58Z

I am not sure why the question is directed at me, but I will try to supply some answers. In general, NumPy accepts contributions from many people, and as time goes by if the code is not maintainable or has outlived its usefulness, we remove it. This is especially true of routines that, like the inner loops of ufuncs, is not exposed to users. Thus we have added and dropped support for different compilers and hardware over the years: notably support for the Apple Accelerate library was removed and then restored when it once again passed the acceptance tests.

The ufunc dispatch framework provides a way to choose the "most appropriate" inner loop, based on runtime detection of supported CPU features, and some heuristics (strides, contiguous data). This PR changes none of that, but the mechanism might need some tweaks if someone takes the time to analyze appropriate heuristics for various ARM processors.

Of course we would prefer all these routines be rewritten in universal intrinsics. But we understand that expertise in them is hard to find. This is true for both SVML and AOR: once we have generic routines we can remove the architecture-specific ones.

Integrating a new version of a vendored dependency is tricky. If the AOR team feels the library is not stable and the interfaces will change drastically, perhaps we should hold off with integration until it stabilizes. That question should be answered by the AOR team.

Personally, while I welcome the contributions, I am not sure the 2x performance increase justifies the additional complexity introduced in this PR. Are there routines or hardware where AOR gives a larger boost?

Mousius · 2023-04-25T14:51:26Z

Closing this in favour of #23399

github-actions bot added the 01 - Enhancement label Feb 7, 2023

Mousius force-pushed the optimized-routines branch from a18fe95 to 05a8073 Compare February 7, 2023 17:35

Mousius force-pushed the optimized-routines branch from 05a8073 to 141a2e5 Compare February 7, 2023 17:41

PierreBlanchard reviewed Feb 7, 2023

View reviewed changes

joeramsay suggested changes Feb 8, 2023

View reviewed changes

mattip reviewed Feb 8, 2023

View reviewed changes

Mousius closed this Apr 25, 2023

Mousius mentioned this pull request Apr 28, 2023

ENH: float64 sin/cos using Numpy intrinsics #23399

Merged

		@@ -1,3 +1,5 @@
		# Copyright 2023 Arm Limited and/or its affiliates <[email protected]>

Uh oh!

ENH: Integrate Optimized Routines for AArch64 #23171

ENH: Integrate Optimized Routines for AArch64 #23171

Uh oh!

Conversation

Mousius commented Feb 7, 2023

Uh oh!

PierreBlanchard left a comment

Choose a reason for hiding this comment

Uh oh!

PierreBlanchard Feb 7, 2023

Choose a reason for hiding this comment

Uh oh!

Mousius Feb 8, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joeramsay left a comment

Choose a reason for hiding this comment

Uh oh!

joeramsay Feb 8, 2023

Choose a reason for hiding this comment

Uh oh!

Mousius Feb 8, 2023

Choose a reason for hiding this comment

Uh oh!

Mousius commented Feb 8, 2023

Uh oh!

joeramsay commented Feb 8, 2023

Uh oh!

Mousius commented Feb 8, 2023

Uh oh!

joeramsay commented Feb 8, 2023

Uh oh!

Mousius commented Feb 8, 2023

Uh oh!

mattip commented Feb 8, 2023

Uh oh!

Mousius commented Feb 8, 2023

Uh oh!

mattip commented Feb 8, 2023

Uh oh!

mattip commented Feb 8, 2023

Uh oh!

Mousius commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented Feb 8, 2023

Uh oh!

mattip Feb 8, 2023

Choose a reason for hiding this comment

Uh oh!

Mousius Feb 8, 2023

Choose a reason for hiding this comment

Uh oh!

seberg Feb 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mousius Feb 10, 2023

Choose a reason for hiding this comment

Uh oh!

Mousius commented Feb 8, 2023

Uh oh!

mattip commented Feb 9, 2023

Uh oh!

Mousius commented Feb 9, 2023

Uh oh!

PierreBlanchard commented Feb 10, 2023

Uh oh!

PierreBlanchard commented Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented Mar 15, 2023

Uh oh!

Mousius commented Apr 25, 2023

Uh oh!

Uh oh!

Mousius commented Feb 8, 2023 •

edited

Loading

seberg Feb 10, 2023 •

edited

Loading

PierreBlanchard commented Mar 15, 2023 •

edited

Loading