-
-
Notifications
You must be signed in to change notification settings - Fork 56.3k
!!! NOT FOR REVIEW !!! [SVE] Example HAL-compatible SVE code for Linear Resize. #20640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Configuration on aarch64 system, with a recent GCC with support for SVE ACLE (a.k.a. "C intrinsics"): CC=gcc-10 CXX=g++-10 cmake ../opencv -DOPENCV_EXTRA_CXX_FLAGS=-march=armv8-a+sve -GNinja Testing with a recent qemu that support SVE: qemu-aarch64 ./bin/opencv_test_imgproc --gtest_filter="Imgproc_Resize*
float *D0 = dst[k]; | ||
float *D1 = dst[k+1]; | ||
|
||
for(dx = 0;dx < len0; dx += nlanes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code reproduces the algorithm written with HAL for the operator()
of HResizeLinearVec_X4
at line https://github.com/opencv/opencv/blob/master/modules/imgproc/src/resize.cpp#L1583
The code is written with SVE intrinsics, in a way that can be transferred directly [*] to HAL code: no fancy SVE features are used, other than runtime increments.
[*] Please notice that the SVE code uses vector conversion of integer to float types, whistle the original code uses the scalar conversion followed by vector constructions. The "build integer vector first and then convert to float" might be also used in the original code without breaking HAL compatibility, because HAL has vector conversions of integers to float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integer arithmetic should be preferred in algorithms if it is possible to avoid accuracy lost due to FP processing. All bit-exact processing is integer-based (EXACT
/ BE
modes).
Some implementations use "working type" (WT) to define where such conversion happens (or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integer arithmetic should be preferred in algorithms if it is possible to avoid accuracy lost due to FP processing. All bit-exact processing is integer-based (
EXACT
/BE
modes).
I don't think there is any issue in loss of precision here? The original code does:
- convert four
u16
intos32
, one by one - convert the four
s32
intof32
- build a vector of four lanes of
f32
- do some math on the
f32x4
vectors
What we do in the SVE version is:
- build a vector of
u16
- widen such vector to a vector of
s32
- convert the
s32
vector into anf32
vector - Do some math on the
f32
vector.
The operations at 1.2.3. are the same in both cases, and the ordering of the FP operation on the f32
vectors in 4. is the same for both cases (unless I am missing something).
My assessment here is that there is no accuracy loss? Or have I missed something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main point about integer/floating point computations that we should not change them between SIMD backends (keep existing processing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh - I see. For the sake of writing a 100% compatible version, we would need to add LUT constructors to the SIMD types so that we can do the scalar u16
->f32
conversion before creating the FP vector.
I could do that, but given that this code is not supposed to be merged into the main branch, I don't see a point in doing it?
|
||
#ifdef __ARM_FEATURE_SVE | ||
// Accuracy test at: ./bin/opencv_test_imgproc --gtest_filter="Imgproc_Resize*" | ||
struct HResizeLinearVec_16u32f_SVE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This part with straightforward loops look good.
It makes sense to take a look on code with several CV_SIMD_WIDTH checks (to handle interleaved data):
https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2669
We need to understand how to deal with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alalek - I have looked into https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2669.
For the sake of simplicity, I focused on the 4 channels part at https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2755, which I report here for reference:
{
CV_Assert(cn == 4);
for (; dx <= w - v_int16::nlanes; dx += v_int16::nlanes, S0 += 2 * v_int16::nlanes, S1 += 2 * v_int16::nlanes, D += v_int16::nlanes)
{
#if CV_SIMD_WIDTH >= 64
v_int64 r00, r01, r10, r11;
v_load_deinterleave((int64_t*)S0, r00, r01);
v_load_deinterleave((int64_t*)S1, r10, r11);
v_int32 r00l, r01l, r10l, r11l, r00h, r01h, r10h, r11h;
v_expand(v_reinterpret_as_s16(r00), r00l, r00h);
v_expand(v_reinterpret_as_s16(r01), r01l, r01h);
v_expand(v_reinterpret_as_s16(r10), r10l, r10h);
v_expand(v_reinterpret_as_s16(r11), r11l, r11h);
v_store(D, v_rshr_pack<2>(r00l + r01l + r10l + r11l, r00h + r01h + r10h + r11h));
#else
v_int32 r0, r1, r2, r3;
r0 = vx_load_expand(S0 ) + vx_load_expand(S1 );
r1 = vx_load_expand(S0 + v_int32::nlanes) + vx_load_expand(S1 + v_int32::nlanes);
r2 = vx_load_expand(S0 + 2*v_int32::nlanes) + vx_load_expand(S1 + 2*v_int32::nlanes);
r3 = vx_load_expand(S0 + 3*v_int32::nlanes) + vx_load_expand(S1 + 3*v_int32::nlanes);
v_int32 dl, dh;
#if CV_SIMD_WIDTH == 16
dl = r0 + r1; dh = r2 + r3;
#elif CV_SIMD_WIDTH == 32
v_int32 t0, t1, t2, t3;
v_recombine(r0, r1, t0, t1); v_recombine(r2, r3, t2, t3);
dl = t0 + t1; dh = t2 + t3;
#endif
v_store(D, v_rshr_pack<2>(dl, dh));
#endif
}
}
I have been trying to understand how this could have worked with SVE, in a configuration in which we would have had to compile/run the code for different implementation of SVE, say of register size (VL, in bytes) VL = 16, 32, 48, 64, ... and so on.
I couldn't figure out why the code has different implementation for different values of CV_SIMD_WIDTH
.
In fact, the code doesn't seem to be using the size of the registers at all. For example, in a situation in which I need to build vectors out of scalar, I could imagine custom code for each size of 16 and 32 bytes vector registers. In the latter case, I would need the double numbers of variables. For example:
#if CV_SIMD_WIDTH == 16
int a0, a1, a2, a3;
//code
v_int32 x = v_int32(a0,a1,a2,a3);
#elif CV_SIMD_WIDTH == 32
int a0, a1, a2, a3, a4, a5, a6, a7;
v_int32 x = v_int32(a0,a1,a2,a3, a4, a5, a6, a7);
#endif
In the code at https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2755, the macro CV_SIMD_WIDTH
seems instead to be used as a way of determining whether the target supports specific features. For example, in the CV_SIMD_WIDTH >= 64
it seems to be using v_load_deinterleave
followed by v_expand
because the avx512 header file doesn't seem to have a definition of vx_load_expand
:
grep --color -nH -e vx_load_expand intrin_avx512.hpp
Grep finished with no matches found at Thu Sep 2 14:38:17
vx_load_expand
is instead used in the else
branch, where CV_SIMD_WIDTH
is less than 64. it turns out that vx_load_expand
is indeed defined with a fallback implementation in the intrin_cpp.hpp
header file.
Please correct me if I am wrong, but it seems to me that the checks on CV_SIMD_WIDTH
here are used to discern on target features (whether a specific intrinsic is available or not), and not to customize the code for different vector lengths. My guess is that determining features out of different vector lengths is good enough if we confine ourselves in the x86 set of vector extensions. It doesn't seem to work well though outside the x86 realm. It definitely doesn't work well for NEON, as it seems that there is a custom implementation written with NEON intrinsics of ResizeAreaFastVec_SIMD_16s
at https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2328, guarded by #if CV_NEON
.
To summarize, I don't think we should use the CV_SIMD_WIDTH
to write custom version of SVE code depending on the vector size. Even if we end up building a version of OpenCV for each for the vector sizes that SVE supports, they will all be using the same code (other than the increments in the loops), because the instruction set will be the same in all cases.
I dare to say, that maybe the conditions using CV_SIMD_WIDTH
should be replaced by macros that guard the availability of architectural features? For example, #if CV_SIMD_WIDTH >= 64
could become #ifdev CV_HAS_DEINTERLEAVE_EXPAND
.
Please let me know if my assessment makes sense to you or not. I am also including @vpisarev in this thread because I would like to heat his opinion too.
Apologies for the long message, but I want to make sure I am not leaving things unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grep --color -nH -e vx_load_expand intrin_avx512.hpp
vx_
functions are macros which automatically mapped on functions for max available/requested SIMD_WIDTH (v_
/ v256_
/ v512_
).
avx512 code has v512_load_expand
for that (see macro OPENCV_HAL_IMPL_AVX512_EXPAND
).
fallback implementation in the intrin_cpp.hpp header file
It is not used for normal builds (there is special "emulation" configuration).
Just in case there are tests for SIMD UI backends: https://github.com/opencv/opencv/blob/4.5.3/modules/core/test/test_intrin_utils.hpp
They are used to verify that implemented types/functions are behave in the same way.
Almost all of them are regularly tested:
- NEON, SSE, AVX2, AVX512_SKX
- RVV (OpenCV CN CI)
- emulator C++ code is tested too (also used for API documentation)
- VSX, MSA - compilation only
- JavaScript SIMD - compilation only (build is broken for now on 4.x branch, but it still works on 3.4)
AFAIK, the main difference between these 2 code paths are number of load instructions: 4=22 (2 per v_load_deinterleave) vs 81 (1 half SIMD register load in vx_load_expand).
Perhaps it is done due to lack of int64 support in SIMD128.
replaced by macros that guard the availability of architectural features
It makes sense. At least we have CV_SIMD128_64F
for that.
#if CV_NEON
This code is here > 4 years.
May be @vpisarev could comment on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced by macros that guard the availability of architectural features
It makes sense. At least we have
CV_SIMD128_64F
for that.
OK, are we agreeing that CV_SIMD_WIDTH
is not used to customize the code based on actual vector size, but to determine which architectural features are available?
If that's the case, 1. we don't need to thing at any implication between the use of CV_SIMD_WIDTH
and the enum<vector_type>::nlanes
, and 2. it would be a good idea to remove CV_SIMD_WIDTH
from the code and replace its uses with macros that have names resembling the architectural features they are guarding.
In particular, 1. means that we don't have to figure out what to do with SVE code to handle different values of CV_SIMD_WIDTH
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't mean that CV_SIMD_WIDTH-like checks can be eliminated everywhere. I mean, if architectural features macros are applicable and clear to understand for developers, then they may be used.
See here: https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2696-L2716
What "architectural features" should be applied in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't mean that CV_SIMD_WIDTH-like checks can be eliminated everywhere. I mean, if architectural features macros are applicable and clear to understand for developers, then they may be used.
I've looked at the code-base more in detail, there are couple of cases in which the CV_SIMD_WIDTH seems to be used for code that is specific to the size of the vector registers. For example:
- For setting up variables with data. In this case, we know we can replace it with a constructor that uses pointers.
- For processing loop tails. In this case it is not clear to me what is the benefit of using
if
whenCV_SIMD_WIDTH
is less than 16. Maybe a slightly more complicated CFG, but likely not to be crucial in term of performance. - More loop tail processing. Here the equivalent C expression
if (v_uint16::nlanes * 2 > 16)
would generate the same code for thenlanes
enum, and evenif (get_nlanes(v_uint16) * 2 > 16)
would do the same.
This is not an exhaustive list, but the more I look in the code base, the more it seems to be that nothing CV_SIMD_WIDTH
-specific is really not replaceable with something that does not depend on the size of the vectors.
See here: https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2696-L2716
What "architectural features" should be applied in that case?
The code here has also nothing in it that relates to the size of the vectors. I suspect that the CV_SIMD_WIDTH == 32
vs CV_SIMD_WIDTH == 64
is to be related to the fact that AVX(2) have 16 vector registers, while AVX512 have 32 vector registers (so it can deal with more variables). This seems to be related to architectural features again, not to vector size.
struct HResizeLinearVec_16u32f_SVE | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original code is a template for all types.
This variant is targeted for single types set.
OpenCV SIMD UI is designed to support templates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is just a quick mock up of a SVE function to replace the original version written with SIMD UI. Of course, we will have to use the template version.
Just to make sure - are you saying that we require the SIMD types <vector _type>:=v_float32|v_float32x4|... v_int64|v_int64x2|...
to be compatible with uses where they are passed as template parameters? We can do that with SVE: https://godbolt.org/z/vxGjcK1eY
float *D0 = dst[k]; | ||
float *D1 = dst[k+1]; | ||
|
||
for(dx = 0;dx < len0; dx += nlanes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integer arithmetic should be preferred in algorithms if it is possible to avoid accuracy lost due to FP processing. All bit-exact processing is integer-based (EXACT
/ BE
modes).
Some implementations use "working type" (WT) to define where such conversion happens (or not).
New scalable universal intrinsics deisgn proposed: |
Info
This is the SVE example requested by @alalek in #20562 - it does not use RGB data but it shows how to write SVE code in a formatthat is compatible to HAL. I will reference this code in my reply for #20562 .
Please notice that this PR is not intended for merging in the production code - it is here just to be used as an example.
Instruction for AArch64 host
Configuration on aarch64 system, with a recent GCC with support for
SVE ACLE (a.k.a. "C intrinsics"):
Testing with a recent qemu that support SVE:
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.