Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: Integrate Optimized Routines for AArch64 #23171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Mousius
Copy link
Member

@Mousius Mousius commented Feb 7, 2023

Adds the initial support for using the Optimized Routines library to improve performance on AArch64.

cos and sin are implemented to demonstrate the flow through to the library calls, more will be added in a follow up patch to align with the existing SVML integration.

I've updated both setup.py and meson.build, but I'm unsure which gets triggered when 🤔

16:47:00  -         729±2μs        397±0.4μs     0.54  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 1, 'd')
16:47:00  -        732±10μs          414±6μs     0.56  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 2, 'd')
16:47:00  -        776±20μs         436±10μs     0.56  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 4, 'd')
16:47:00  -         731±2μs        419±0.3μs     0.57  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 1, 'd')
16:47:00  -         732±3μs        439±0.5μs     0.60  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 2, 'd')
16:47:00  -        735±20μs         440±10μs     0.60  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 4, 'd')
16:47:00  -         736±1μs          420±1μs     0.57  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 1, 'd')
16:47:00  -       736±0.7μs          441±1μs     0.60  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 2, 'd')
16:47:00  -         736±2μs          442±2μs     0.60  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 4, 'd')
16:47:00  -         871±5μs        395±0.3μs     0.45  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 1, 'd')
16:47:00  -        876±10μs          413±6μs     0.47  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 2, 'd')
16:47:00  -        928±30μs         436±10μs     0.47  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 4, 'd')
16:47:00  -         871±5μs        417±0.2μs     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 1, 'd')
16:47:00  -         874±2μs        430±0.2μs     0.49  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 2, 'd')
16:47:00  -        876±20μs         430±10μs     0.49  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 4, 'd')
16:47:00  -         876±4μs          419±1μs     0.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 1, 'd')
16:47:00  -         872±2μs          431±2μs     0.49  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 2, 'd')
16:47:00  -         875±6μs          434±2μs     0.50  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 4, 'd')

See: https://mail.python.org/archives/list/[email protected]/message/GTHX4TFRUCGQI2VPHEWMEC4GBOAOOH4C/

Adds the initial support for using the Optimized Routines library to improve performance on AArch64.

`cos` and `sin` are implemented to demonstrate the flow through to the library calls, more will be added in a follow up patch.

See: https://mail.python.org/archives/list/[email protected]/message/GTHX4TFRUCGQI2VPHEWMEC4GBOAOOH4C/
Change-Id: Idb5fb312313e5577cc8db0edbef02707fabd7006
@Mousius Mousius force-pushed the optimized-routines branch from 05a8073 to 141a2e5 Compare February 7, 2023 17:41
Copy link

@PierreBlanchard PierreBlanchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits for now.

** $maxopt baseline avx512_skx
** $maxopt baseline
** avx512_skx
** asimd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think advsimd is the preferred name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the standard naming within numpy itself, I don't think there's a strong enough reason to change it?

Copy link

@joeramsay joeramsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly apologetic comment - AOR/pl has recently removed some symbols, and likely AOR/math will at some point soon as well, so the way function names are chosen is not future-proof (and won't work at all if you want any routines from pl/). How are you building this? I don't see any config.mk file in this patch, so interested in how AOR is being configured?

@@ -74,7 +92,11 @@ simd_@func@_f64(const double *src, npy_intp ssrc,
} else {
x = npyv_loadn_tillz_f64(src, ssrc, len);
}
#if defined(_NPY_OPTIMIZED_ROUTINES)
npyv_f64 out = __v_@func@(x);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the pl/ subdir of AOR, and likely in math/ as well, the __v_-prefixed symbols are being removed (see for example this commit). From now on the only reliably supported names are the VFABI-mangled variants, so I don't think this line is future-proof

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, this makes it slightly more irritating than the existing SVML integration, I can change these when I integrate the next set of functions though.

@Mousius
Copy link
Member Author

Mousius commented Feb 8, 2023

Slightly apologetic comment - AOR/pl has recently removed some symbols, and likely AOR/math will at some point soon as well, so the way function names are chosen is not future-proof (and won't work at all if you want any routines from pl/). How are you building this? I don't see any config.mk file in this patch, so interested in how AOR is being configured?

I'm just building the bits I need as a submodule in numpy rather than taking the whole library in, that would require additional effort that doesn't appear to be justified?

@joeramsay
Copy link

Slightly apologetic comment - AOR/pl has recently removed some symbols, and likely AOR/math will at some point soon as well, so the way function names are chosen is not future-proof (and won't work at all if you want any routines from pl/). How are you building this? I don't see any config.mk file in this patch, so interested in how AOR is being configured?

I'm just building the bits I need as a submodule in numpy rather than taking the whole library in, that would require additional effort that doesn't appear to be justified?

So the C files are just compiled independently of the AOR Makefiles? Makes sense, but since AOR is designed to be build with a config file, you may end up missing certain features (or even routines) that are enabled based on settings in the config file, for instance without -DWANT_SIMD_EXCEPT=1 you will not have fp exceptions triggered correctly. My understanding is that numpy requires this, so I'm surprised not to see it mentioned here

@Mousius
Copy link
Member Author

Mousius commented Feb 8, 2023

Slightly apologetic comment - AOR/pl has recently removed some symbols, and likely AOR/math will at some point soon as well, so the way function names are chosen is not future-proof (and won't work at all if you want any routines from pl/). How are you building this? I don't see any config.mk file in this patch, so interested in how AOR is being configured?

I'm just building the bits I need as a submodule in numpy rather than taking the whole library in, that would require additional effort that doesn't appear to be justified?

So the C files are just compiled independently of the AOR Makefiles? Makes sense, but since AOR is designed to be build with a config file, you may end up missing certain features (or even routines) that are enabled based on settings in the config file, for instance without -DWANT_SIMD_EXCEPT=1 you will not have fp exceptions triggered correctly. My understanding is that numpy requires this, so I'm surprised not to see it mentioned here

Likewise, but if the tests are passing I would consider that sufficient for now, I'm keen not to add too many additional flags if I can avoid it especially given they're not namespaced to Optimized Routines so could cause other software to compile weirdly.

@joeramsay
Copy link

Slightly apologetic comment - AOR/pl has recently removed some symbols, and likely AOR/math will at some point soon as well, so the way function names are chosen is not future-proof (and won't work at all if you want any routines from pl/). How are you building this? I don't see any config.mk file in this patch, so interested in how AOR is being configured?

I'm just building the bits I need as a submodule in numpy rather than taking the whole library in, that would require additional effort that doesn't appear to be justified?

So the C files are just compiled independently of the AOR Makefiles? Makes sense, but since AOR is designed to be build with a config file, you may end up missing certain features (or even routines) that are enabled based on settings in the config file, for instance without -DWANT_SIMD_EXCEPT=1 you will not have fp exceptions triggered correctly. My understanding is that numpy requires this, so I'm surprised not to see it mentioned here

Likewise, but if the tests are passing I would consider that sufficient for now, I'm keen not to add too many additional flags if I can avoid it especially given they're not namespaced to Optimized Routines so could cause other software to compile weirdly.

I understand what you're saying, but I think this could become very difficult to maintain for a couple of reasons. One is that AOR routines are not always completely self-contained in one source file - there may be helper routines or coefficient arrays in a different TU (not the case for sin/cos, so it was fine for this patch). These get moved around from time to time - we don't make any promises about what files will exist, and indeed we are in the process of shuffling some things around. The VFABI-mangled symbols don't come from v_<func>.c at all, they come from vn_<func>.c but again, we are in the process of changing this. The only way to reliably get the symbols you need is by building the whole library.

As well, there are certain flags that at some point you will need in your compiler invocation, for instance you can't build SVE routines without explicitly enabling them, either in the config file or with -DWANT_SVE_MATH=1.

@Mousius
Copy link
Member Author

Mousius commented Feb 8, 2023

Slightly apologetic comment - AOR/pl has recently removed some symbols, and likely AOR/math will at some point soon as well, so the way function names are chosen is not future-proof (and won't work at all if you want any routines from pl/). How are you building this? I don't see any config.mk file in this patch, so interested in how AOR is being configured?

I'm just building the bits I need as a submodule in numpy rather than taking the whole library in, that would require additional effort that doesn't appear to be justified?

So the C files are just compiled independently of the AOR Makefiles? Makes sense, but since AOR is designed to be build with a config file, you may end up missing certain features (or even routines) that are enabled based on settings in the config file, for instance without -DWANT_SIMD_EXCEPT=1 you will not have fp exceptions triggered correctly. My understanding is that numpy requires this, so I'm surprised not to see it mentioned here

Likewise, but if the tests are passing I would consider that sufficient for now, I'm keen not to add too many additional flags if I can avoid it especially given they're not namespaced to Optimized Routines so could cause other software to compile weirdly.

I understand what you're saying, but I think this could become very difficult to maintain for a couple of reasons. One is that AOR routines are not always completely self-contained in one source file - there may be helper routines or coefficient arrays in a different TU (not the case for sin/cos, so it was fine for this patch). These get moved around from time to time - we don't make any promises about what files will exist, and indeed we are in the process of shuffling some things around. The VFABI-mangled symbols don't come from v_<func>.c at all, they come from vn_<func>.c but again, we are in the process of changing this. The only way to reliably get the symbols you need is by building the whole library.

As well, there are certain flags that at some point you will need in your compiler invocation, for instance you can't build SVE routines without explicitly enabling them, either in the config file or with -DWANT_SVE_MATH=1.

I think that's ok as from the brief discussion with @mattip, the focus should be on moving to universal intrinsics in numpy. AOR is a good intermediate step for the performance boost but we can avoid long-term dependence on it if we can't minimally include it. In the intermediary state we can accept having the files as-is and a slightly painful upgrade path, that'll likely land on me 😸

@mattip
Copy link
Member

mattip commented Feb 8, 2023

Perusing the source repo, I came across this file. Is this the implementation of cos? It has this comment:

/* worst-case error is 3.5 ulp.
   abs error: 0x1.be222a58p-53 in [-pi/2, pi/2].  */

I would think that would exceed our tests, so I must be looking at the wrong code.

@Mousius
Copy link
Member Author

Mousius commented Feb 8, 2023

Perusing the source repo, I came across this file. Is this the implementation of cos? It has this comment:

/* worst-case error is 3.5 ulp.
   abs error: 0x1.be222a58p-53 in [-pi/2, pi/2].  */

I would think that would exceed our tests, so I must be looking at the wrong code.

Interesting as the original SVML implementation is 4ULP (#19478)

@mattip
Copy link
Member

mattip commented Feb 8, 2023

We discussed this when adding the validation tests, starting around here in the PR. Here is what cpu features are used in CI. Are the new code paths triggered there?

@mattip
Copy link
Member

mattip commented Feb 8, 2023

I've updated both setup.py and meson.build, but I'm unsure which gets triggered when

We are transitioning from setup.py to meson. The aarch64 CI run is still using setup.py.

@Mousius
Copy link
Member Author

Mousius commented Feb 8, 2023

We discussed this when adding the validation tests, starting around here in the PR. Here is what cpu features are used in CI. Are the new code paths triggered there?

They should be, as I had to add the git submodule update --init line into Cirrus for the cibuildwheel build to work 😸

@mattip
Copy link
Member

mattip commented Feb 8, 2023

Adding the submodule means the routines were compiled in. But how do we know they were used in the runtime? The only information we get is that the CPU detection kicked in, which should then choose the correct inner loops. Which features are required for these routines?

@@ -1,3 +1,5 @@
# Copyright 2023 Arm Limited and/or its affiliates <[email protected]>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you adding a copyright to this file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the small amount of additional logic changed to pull the submodule.

Copy link
Member

@seberg seberg Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that legal teams like copyright notices, but we don't have a habit of adding copyright notices to files and at least right now I would prefer to keep it that way and you explain this to the legal team at ARM.

If we really did some of these files would be littered by personal or company copyrights (quansight, nvidia, intel, apple, ...) and probably even more universities...

If the legal team needs a more definite no from us, I suspect we can just give that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg I've raised this with our team and I'll let you know the outcome, I have to try and follow best practices first but it's relatively common for us to have a reasonable discussion as to what suits the project 😸

I would point out that there are a number of explicit copyright claims through-out the codebase which imply this is a practice, maybe worth consideration of how we make that more consistent?

@Mousius
Copy link
Member Author

Mousius commented Feb 8, 2023

Adding the submodule means the routines were compiled in. But how do we know they were used in the runtime? The only information we get is that the CPU detection kicked in, which should then choose the correct inner loops. Which features are required for these routines?

They all trigger from ASIMD, which should be standard on AArch64 machines.

@mattip
Copy link
Member

mattip commented Feb 9, 2023

Indeed, ASIMD is in the set of required baseline features. The cirrusCI machines report:

NumPy CPU features:  NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM?

@Mousius
Copy link
Member Author

Mousius commented Feb 9, 2023

Indeed, ASIMD is in the set of required baseline features. The cirrusCI machines report:

NumPy CPU features:  NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM?

Hmm, so I'm guessing this is all fine in terms of the tests or are there further concerns from your point of view?

@PierreBlanchard
Copy link

We discussed this when adding the validation tests, starting around here in the PR. Here is what cpu features are used in CI. Are the new code paths triggered there?

Regarding accuracy our design principle is to have a threshold of 3.5ULPs over the entire range of libm routines, like SVML has a 4ULPs threshold. Unfortunately we cannot provide better accuracy than a given reference on a per routine basis, hence why all libraries use a single threshold (glibc's libmvec has 4ULPs too).
Are you using the same error threshold for testing SVML and the regular scalar implementation?

Am I right thinking your test thresholds are based on your evaluation of ULP errors, or are these ULP errors provided by SVML? Evaluating the maximum error for double precision routines is notoriously hard and fairly inaccurate by sampling the whole domain or even a reduced interval randomly (even with billions of points).

This is also a reason why AOR routines might appear to pass the tests even with maximum errors larger than the current threshold. A random set of test inputs will likely not trigger the worst cases even with millions of inputs.

Let me know if this discussion is better suited for this PR.

@PierreBlanchard
Copy link

PierreBlanchard commented Mar 15, 2023

Hi @mattip !
I have a few naive questions about this PR.
What is going to happen to these routines once ported to numpy intrinsics? How are they going to be used?
Is the point to be able to use them on other architectures than Arm? Or is there more to it?
Are you still going to keep SVML implementations? And use them on AVX512 enabled machines then? Are you going to pick the fastest implementation?

How are they going to be maintained? By who? Since AOR implementations are continuously improving, they might diverge from the last drop in quite significantly. So each update would basically consist in doing a full port again from AOR to numpy intrinsics. The situation would become even much worse if they were also modified from within numpy.

An option to keep maintenance low would be to provide an AOR to Numpy interface. We just got rid of this feature but might be able to re-introduce in a different way, if necessary. The code would still have to be configured (pre-processed) so it has the features Numpy expects.

Numpy intrinsics seems like an interesting concept, just curious about the implications when trying to "upstream" external work into Numpy using this language.

@mattip
Copy link
Member

mattip commented Mar 15, 2023

I am not sure why the question is directed at me, but I will try to supply some answers. In general, NumPy accepts contributions from many people, and as time goes by if the code is not maintainable or has outlived its usefulness, we remove it. This is especially true of routines that, like the inner loops of ufuncs, is not exposed to users. Thus we have added and dropped support for different compilers and hardware over the years: notably support for the Apple Accelerate library was removed and then restored when it once again passed the acceptance tests.

The ufunc dispatch framework provides a way to choose the "most appropriate" inner loop, based on runtime detection of supported CPU features, and some heuristics (strides, contiguous data). This PR changes none of that, but the mechanism might need some tweaks if someone takes the time to analyze appropriate heuristics for various ARM processors.

Of course we would prefer all these routines be rewritten in universal intrinsics. But we understand that expertise in them is hard to find. This is true for both SVML and AOR: once we have generic routines we can remove the architecture-specific ones.

Integrating a new version of a vendored dependency is tricky. If the AOR team feels the library is not stable and the interfaces will change drastically, perhaps we should hold off with integration until it stabilizes. That question should be answered by the AOR team.

Personally, while I welcome the contributions, I am not sure the 2x performance increase justifies the additional complexity introduced in this PR. Are there routines or hardware where AOR gives a larger boost?

@Mousius
Copy link
Member Author

Mousius commented Apr 25, 2023

Closing this in favour of #23399

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

5 participants