-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: Integrate Optimized Routines for AArch64 #23171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a18fe95
to
05a8073
Compare
Adds the initial support for using the Optimized Routines library to improve performance on AArch64. `cos` and `sin` are implemented to demonstrate the flow through to the library calls, more will be added in a follow up patch. See: https://mail.python.org/archives/list/[email protected]/message/GTHX4TFRUCGQI2VPHEWMEC4GBOAOOH4C/ Change-Id: Idb5fb312313e5577cc8db0edbef02707fabd7006
05a8073
to
141a2e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits for now.
** $maxopt baseline avx512_skx | ||
** $maxopt baseline | ||
** avx512_skx | ||
** asimd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think advsimd
is the preferred name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the standard naming within numpy
itself, I don't think there's a strong enough reason to change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly apologetic comment - AOR/pl has recently removed some symbols, and likely AOR/math will at some point soon as well, so the way function names are chosen is not future-proof (and won't work at all if you want any routines from pl/). How are you building this? I don't see any config.mk file in this patch, so interested in how AOR is being configured?
@@ -74,7 +92,11 @@ simd_@func@_f64(const double *src, npy_intp ssrc, | |||
} else { | |||
x = npyv_loadn_tillz_f64(src, ssrc, len); | |||
} | |||
#if defined(_NPY_OPTIMIZED_ROUTINES) | |||
npyv_f64 out = __v_@func@(x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the pl/ subdir of AOR, and likely in math/ as well, the __v_
-prefixed symbols are being removed (see for example this commit). From now on the only reliably supported names are the VFABI-mangled variants, so I don't think this line is future-proof
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh, this makes it slightly more irritating than the existing SVML integration, I can change these when I integrate the next set of functions though.
I'm just building the bits I need as a submodule in |
So the C files are just compiled independently of the AOR Makefiles? Makes sense, but since AOR is designed to be build with a config file, you may end up missing certain features (or even routines) that are enabled based on settings in the config file, for instance without |
Likewise, but if the tests are passing I would consider that sufficient for now, I'm keen not to add too many additional flags if I can avoid it especially given they're not namespaced to Optimized Routines so could cause other software to compile weirdly. |
I understand what you're saying, but I think this could become very difficult to maintain for a couple of reasons. One is that AOR routines are not always completely self-contained in one source file - there may be helper routines or coefficient arrays in a different TU (not the case for sin/cos, so it was fine for this patch). These get moved around from time to time - we don't make any promises about what files will exist, and indeed we are in the process of shuffling some things around. The VFABI-mangled symbols don't come from As well, there are certain flags that at some point you will need in your compiler invocation, for instance you can't build SVE routines without explicitly enabling them, either in the config file or with |
I think that's ok as from the brief discussion with @mattip, the focus should be on moving to universal intrinsics in |
Perusing the source repo, I came across this file. Is this the implementation of
I would think that would exceed our tests, so I must be looking at the wrong code. |
Interesting as the original SVML implementation is 4ULP (#19478) |
We are transitioning from |
Adding the submodule means the routines were compiled in. But how do we know they were used in the runtime? The only information we get is that the CPU detection kicked in, which should then choose the correct inner loops. Which features are required for these routines? |
@@ -1,3 +1,5 @@ | |||
# Copyright 2023 Arm Limited and/or its affiliates <[email protected]> | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you adding a copyright to this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the small amount of additional logic changed to pull the submodule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that legal teams like copyright notices, but we don't have a habit of adding copyright notices to files and at least right now I would prefer to keep it that way and you explain this to the legal team at ARM.
If we really did some of these files would be littered by personal or company copyrights (quansight, nvidia, intel, apple, ...) and probably even more universities...
If the legal team needs a more definite no from us, I suspect we can just give that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg I've raised this with our team and I'll let you know the outcome, I have to try and follow best practices first but it's relatively common for us to have a reasonable discussion as to what suits the project 😸
I would point out that there are a number of explicit copyright claims through-out the codebase which imply this is a practice, maybe worth consideration of how we make that more consistent?
They all trigger from ASIMD, which should be standard on AArch64 machines. |
Indeed,
|
Hmm, so I'm guessing this is all fine in terms of the tests or are there further concerns from your point of view? |
Regarding accuracy our design principle is to have a threshold of 3.5ULPs over the entire range of libm routines, like SVML has a 4ULPs threshold. Unfortunately we cannot provide better accuracy than a given reference on a per routine basis, hence why all libraries use a single threshold (glibc's libmvec has 4ULPs too). Am I right thinking your test thresholds are based on your evaluation of ULP errors, or are these ULP errors provided by SVML? Evaluating the maximum error for double precision routines is notoriously hard and fairly inaccurate by sampling the whole domain or even a reduced interval randomly (even with billions of points). This is also a reason why AOR routines might appear to pass the tests even with maximum errors larger than the current threshold. A random set of test inputs will likely not trigger the worst cases even with millions of inputs. Let me know if this discussion is better suited for this PR. |
Hi @mattip ! How are they going to be maintained? By who? Since AOR implementations are continuously improving, they might diverge from the last drop in quite significantly. So each update would basically consist in doing a full port again from AOR to numpy intrinsics. The situation would become even much worse if they were also modified from within numpy. An option to keep maintenance low would be to provide an AOR to Numpy interface. We just got rid of this feature but might be able to re-introduce in a different way, if necessary. The code would still have to be configured (pre-processed) so it has the features Numpy expects. Numpy intrinsics seems like an interesting concept, just curious about the implications when trying to "upstream" external work into Numpy using this language. |
I am not sure why the question is directed at me, but I will try to supply some answers. In general, NumPy accepts contributions from many people, and as time goes by if the code is not maintainable or has outlived its usefulness, we remove it. This is especially true of routines that, like the inner loops of ufuncs, is not exposed to users. Thus we have added and dropped support for different compilers and hardware over the years: notably support for the Apple Accelerate library was removed and then restored when it once again passed the acceptance tests. The ufunc dispatch framework provides a way to choose the "most appropriate" inner loop, based on runtime detection of supported CPU features, and some heuristics (strides, contiguous data). This PR changes none of that, but the mechanism might need some tweaks if someone takes the time to analyze appropriate heuristics for various ARM processors. Of course we would prefer all these routines be rewritten in universal intrinsics. But we understand that expertise in them is hard to find. This is true for both SVML and AOR: once we have generic routines we can remove the architecture-specific ones. Integrating a new version of a vendored dependency is tricky. If the AOR team feels the library is not stable and the interfaces will change drastically, perhaps we should hold off with integration until it stabilizes. That question should be answered by the AOR team. Personally, while I welcome the contributions, I am not sure the 2x performance increase justifies the additional complexity introduced in this PR. Are there routines or hardware where AOR gives a larger boost? |
Closing this in favour of #23399 |
Adds the initial support for using the Optimized Routines library to improve performance on AArch64.
cos
andsin
are implemented to demonstrate the flow through to the library calls, more will be added in a follow up patch to align with the existing SVML integration.I've updated both setup.py and meson.build, but I'm unsure which gets triggered when 🤔
See: https://mail.python.org/archives/list/[email protected]/message/GTHX4TFRUCGQI2VPHEWMEC4GBOAOOH4C/