!!! NOT FOR REVIEW !!! [SVE] Example HAL-compatible SVE code for Linear Resize. #20640

fpetrogalli · 2021-09-01T09:22:15Z

Info

This is the SVE example requested by @alalek in #20562 - it does not use RGB data but it shows how to write SVE code in a formatthat is compatible to HAL. I will reference this code in my reply for #20562 .

Please notice that this PR is not intended for merging in the production code - it is here just to be used as an example.

Instruction for AArch64 host

Configuration on aarch64 system, with a recent GCC with support for
SVE ACLE (a.k.a. "C intrinsics"):

CC=gcc-10 CXX=g++-10 cmake ../opencv -DOPENCV_EXTRA_CXX_FLAGS=-march=armv8-a+sve -GNinja

Testing with a recent qemu that support SVE:

qemu-aarch64 ./bin/opencv_test_imgproc --gtest_filter="Imgproc_Resize*

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

Configuration on aarch64 system, with a recent GCC with support for SVE ACLE (a.k.a. "C intrinsics"): CC=gcc-10 CXX=g++-10 cmake ../opencv -DOPENCV_EXTRA_CXX_FLAGS=-march=armv8-a+sve -GNinja Testing with a recent qemu that support SVE: qemu-aarch64 ./bin/opencv_test_imgproc --gtest_filter="Imgproc_Resize*

fpetrogalli · 2021-09-01T09:27:42Z

modules/imgproc/src/resize.cpp

+		float *D0 = dst[k];
+		float *D1 = dst[k+1];
+
+		for(dx = 0;dx < len0; dx += nlanes)


This code reproduces the algorithm written with HAL for the operator() of HResizeLinearVec_X4 at line https://github.com/opencv/opencv/blob/master/modules/imgproc/src/resize.cpp#L1583

The code is written with SVE intrinsics, in a way that can be transferred directly [*] to HAL code: no fancy SVE features are used, other than runtime increments.

[*] Please notice that the SVE code uses vector conversion of integer to float types, whistle the original code uses the scalar conversion followed by vector constructions. The "build integer vector first and then convert to float" might be also used in the original code without breaking HAL compatibility, because HAL has vector conversions of integers to float.

Integer arithmetic should be preferred in algorithms if it is possible to avoid accuracy lost due to FP processing. All bit-exact processing is integer-based (EXACT / BE modes).

Some implementations use "working type" (WT) to define where such conversion happens (or not).

Integer arithmetic should be preferred in algorithms if it is possible to avoid accuracy lost due to FP processing. All bit-exact processing is integer-based (EXACT / BE modes).

I don't think there is any issue in loss of precision here? The original code does:

convert four u16 into s32, one by one

convert the four s32 into f32

build a vector of four lanes of f32

do some math on the f32x4 vectors

What we do in the SVE version is:

build a vector of u16

widen such vector to a vector of s32

convert the s32 vector into an f32 vector

Do some math on the f32 vector.

The operations at 1.2.3. are the same in both cases, and the ordering of the FP operation on the f32 vectors in 4. is the same for both cases (unless I am missing something).

My assessment here is that there is no accuracy loss? Or have I missed something?

Main point about integer/floating point computations that we should not change them between SIMD backends (keep existing processing).

Oh - I see. For the sake of writing a 100% compatible version, we would need to add LUT constructors to the SIMD types so that we can do the scalar u16->f32 conversion before creating the FP vector.

I could do that, but given that this code is not supposed to be merged into the main branch, I don't see a point in doing it?

alalek · 2021-09-01T10:33:45Z

modules/imgproc/src/resize.cpp


+#ifdef __ARM_FEATURE_SVE
+// Accuracy test at: ./bin/opencv_test_imgproc --gtest_filter="Imgproc_Resize*"
+struct HResizeLinearVec_16u32f_SVE


Thank you! This part with straightforward loops look good.

It makes sense to take a look on code with several CV_SIMD_WIDTH checks (to handle interleaved data):

https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2669

We need to understand how to deal with them.

Hi @alalek - I have looked into https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2669.

For the sake of simplicity, I focused on the 4 channels part at https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2755, which I report here for reference:

{ CV_Assert(cn == 4); for (; dx <= w - v_int16::nlanes; dx += v_int16::nlanes, S0 += 2 * v_int16::nlanes, S1 += 2 * v_int16::nlanes, D += v_int16::nlanes) { #if CV_SIMD_WIDTH >= 64 v_int64 r00, r01, r10, r11; v_load_deinterleave((int64_t*)S0, r00, r01); v_load_deinterleave((int64_t*)S1, r10, r11); v_int32 r00l, r01l, r10l, r11l, r00h, r01h, r10h, r11h; v_expand(v_reinterpret_as_s16(r00), r00l, r00h); v_expand(v_reinterpret_as_s16(r01), r01l, r01h); v_expand(v_reinterpret_as_s16(r10), r10l, r10h); v_expand(v_reinterpret_as_s16(r11), r11l, r11h); v_store(D, v_rshr_pack<2>(r00l + r01l + r10l + r11l, r00h + r01h + r10h + r11h)); #else v_int32 r0, r1, r2, r3; r0 = vx_load_expand(S0 ) + vx_load_expand(S1 ); r1 = vx_load_expand(S0 + v_int32::nlanes) + vx_load_expand(S1 + v_int32::nlanes); r2 = vx_load_expand(S0 + 2*v_int32::nlanes) + vx_load_expand(S1 + 2*v_int32::nlanes); r3 = vx_load_expand(S0 + 3*v_int32::nlanes) + vx_load_expand(S1 + 3*v_int32::nlanes); v_int32 dl, dh; #if CV_SIMD_WIDTH == 16 dl = r0 + r1; dh = r2 + r3; #elif CV_SIMD_WIDTH == 32 v_int32 t0, t1, t2, t3; v_recombine(r0, r1, t0, t1); v_recombine(r2, r3, t2, t3); dl = t0 + t1; dh = t2 + t3; #endif v_store(D, v_rshr_pack<2>(dl, dh)); #endif } }

I have been trying to understand how this could have worked with SVE, in a configuration in which we would have had to compile/run the code for different implementation of SVE, say of register size (VL, in bytes) VL = 16, 32, 48, 64, ... and so on.

I couldn't figure out why the code has different implementation for different values of CV_SIMD_WIDTH.
In fact, the code doesn't seem to be using the size of the registers at all. For example, in a situation in which I need to build vectors out of scalar, I could imagine custom code for each size of 16 and 32 bytes vector registers. In the latter case, I would need the double numbers of variables. For example:

#if CV_SIMD_WIDTH == 16 int a0, a1, a2, a3; //code v_int32 x = v_int32(a0,a1,a2,a3); #elif CV_SIMD_WIDTH == 32 int a0, a1, a2, a3, a4, a5, a6, a7; v_int32 x = v_int32(a0,a1,a2,a3, a4, a5, a6, a7); #endif

In the code at https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2755, the macro CV_SIMD_WIDTH seems instead to be used as a way of determining whether the target supports specific features. For example, in the CV_SIMD_WIDTH >= 64 it seems to be using v_load_deinterleave followed by v_expand because the avx512 header file doesn't seem to have a definition of vx_load_expand:

grep --color -nH -e vx_load_expand intrin_avx512.hpp Grep finished with no matches found at Thu Sep 2 14:38:17

vx_load_expand is instead used in the else branch, where CV_SIMD_WIDTH is less than 64. it turns out that vx_load_expand is indeed defined with a fallback implementation in the intrin_cpp.hpp header file.

Please correct me if I am wrong, but it seems to me that the checks on CV_SIMD_WIDTH here are used to discern on target features (whether a specific intrinsic is available or not), and not to customize the code for different vector lengths. My guess is that determining features out of different vector lengths is good enough if we confine ourselves in the x86 set of vector extensions. It doesn't seem to work well though outside the x86 realm. It definitely doesn't work well for NEON, as it seems that there is a custom implementation written with NEON intrinsics of ResizeAreaFastVec_SIMD_16s at https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2328, guarded by #if CV_NEON.

To summarize, I don't think we should use the CV_SIMD_WIDTH to write custom version of SVE code depending on the vector size. Even if we end up building a version of OpenCV for each for the vector sizes that SVE supports, they will all be using the same code (other than the increments in the loops), because the instruction set will be the same in all cases.

I dare to say, that maybe the conditions using CV_SIMD_WIDTH should be replaced by macros that guard the availability of architectural features? For example, #if CV_SIMD_WIDTH >= 64 could become #ifdev CV_HAS_DEINTERLEAVE_EXPAND.

Please let me know if my assessment makes sense to you or not. I am also including @vpisarev in this thread because I would like to heat his opinion too.

Apologies for the long message, but I want to make sure I am not leaving things unclear.

grep --color -nH -e vx_load_expand intrin_avx512.hpp

vx_ functions are macros which automatically mapped on functions for max available/requested SIMD_WIDTH (v_ / v256_ / v512_).
avx512 code has v512_load_expand for that (see macro OPENCV_HAL_IMPL_AVX512_EXPAND).

fallback implementation in the intrin_cpp.hpp header file

It is not used for normal builds (there is special "emulation" configuration).

Just in case there are tests for SIMD UI backends: https://github.com/opencv/opencv/blob/4.5.3/modules/core/test/test_intrin_utils.hpp

They are used to verify that implemented types/functions are behave in the same way.

Almost all of them are regularly tested:

NEON, SSE, AVX2, AVX512_SKX

RVV (OpenCV CN CI)

emulator C++ code is tested too (also used for API documentation)

VSX, MSA - compilation only

JavaScript SIMD - compilation only (build is broken for now on 4.x branch, but it still works on 3.4)

AFAIK, the main difference between these 2 code paths are number of load instructions: 4=22 (2 per v_load_deinterleave) vs 81 (1 half SIMD register load in vx_load_expand).

Perhaps it is done due to lack of int64 support in SIMD128.

replaced by macros that guard the availability of architectural features

It makes sense. At least we have CV_SIMD128_64F for that.

#if CV_NEON

This code is here > 4 years.
May be @vpisarev could comment on that.

replaced by macros that guard the availability of architectural features

It makes sense. At least we have CV_SIMD128_64F for that.

OK, are we agreeing that CV_SIMD_WIDTH is not used to customize the code based on actual vector size, but to determine which architectural features are available?
If that's the case, 1. we don't need to thing at any implication between the use of CV_SIMD_WIDTH and the enum<vector_type>::nlanes, and 2. it would be a good idea to remove CV_SIMD_WIDTH from the code and replace its uses with macros that have names resembling the architectural features they are guarding.

In particular, 1. means that we don't have to figure out what to do with SVE code to handle different values of CV_SIMD_WIDTH.

This doesn't mean that CV_SIMD_WIDTH-like checks can be eliminated everywhere. I mean, if architectural features macros are applicable and clear to understand for developers, then they may be used.

See here: https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2696-L2716
What "architectural features" should be applied in that case?

This doesn't mean that CV_SIMD_WIDTH-like checks can be eliminated everywhere. I mean, if architectural features macros are applicable and clear to understand for developers, then they may be used.

I've looked at the code-base more in detail, there are couple of cases in which the CV_SIMD_WIDTH seems to be used for code that is specific to the size of the vector registers. For example:

For setting up variables with data. In this case, we know we can replace it with a constructor that uses pointers.

For processing loop tails. In this case it is not clear to me what is the benefit of using if when CV_SIMD_WIDTH is less than 16. Maybe a slightly more complicated CFG, but likely not to be crucial in term of performance.

More loop tail processing. Here the equivalent C expression if (v_uint16::nlanes * 2 > 16) would generate the same code for the nlanes enum, and even if (get_nlanes(v_uint16) * 2 > 16) would do the same.

This is not an exhaustive list, but the more I look in the code base, the more it seems to be that nothing CV_SIMD_WIDTH-specific is really not replaceable with something that does not depend on the size of the vectors.

See here: https://github.com/opencv/opencv/blob/4.5.3/modules/imgproc/src/resize.cpp#L2696-L2716
What "architectural features" should be applied in that case?

The code here has also nothing in it that relates to the size of the vectors. I suspect that the CV_SIMD_WIDTH == 32 vs CV_SIMD_WIDTH == 64 is to be related to the fact that AVX(2) have 16 vector registers, while AVX512 have 32 vector registers (so it can deal with more variables). This seems to be related to architectural features again, not to vector size.

alalek · 2021-09-01T10:37:28Z

modules/imgproc/src/resize.cpp

+struct HResizeLinearVec_16u32f_SVE
+{


Original code is a template for all types.
This variant is targeted for single types set.

OpenCV SIMD UI is designed to support templates.

Yeah, this is just a quick mock up of a SVE function to replace the original version written with SIMD UI. Of course, we will have to use the template version.

Just to make sure - are you saying that we require the SIMD types <vector _type>:=v_float32|v_float32x4|... v_int64|v_int64x2|... to be compatible with uses where they are passed as template parameters? We can do that with SVE: https://godbolt.org/z/vxGjcK1eY

alalek · 2021-09-01T10:41:23Z

modules/imgproc/src/resize.cpp

+		float *D0 = dst[k];
+		float *D1 = dst[k+1];
+
+		for(dx = 0;dx < len0; dx += nlanes)


Integer arithmetic should be preferred in algorithms if it is possible to avoid accuracy lost due to FP processing. All bit-exact processing is integer-based (EXACT / BE modes).

Some implementations use "working type" (WT) to define where such conversion happens (or not).

asmorkalov · 2022-09-16T13:30:17Z

New scalable universal intrinsics deisgn proposed:

fpetrogalli commented Sep 1, 2021

View reviewed changes

fpetrogalli mentioned this pull request Sep 1, 2021

Runtime nlanes for SVE enablement #20562

Closed

6 tasks

alalek reviewed Sep 1, 2021

View reviewed changes

asmorkalov closed this Sep 16, 2022

		struct HResizeLinearVec_16u32f_SVE
		{

Uh oh!

!!! NOT FOR REVIEW !!! [SVE] Example HAL-compatible SVE code for Linear Resize. #20640

!!! NOT FOR REVIEW !!! [SVE] Example HAL-compatible SVE code for Linear Resize. #20640

Uh oh!

Conversation

fpetrogalli commented Sep 1, 2021

Info

Instruction for AArch64 host

Pull Request Readiness Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alalek Sep 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asmorkalov commented Sep 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alalek Sep 2, 2021 •

edited

Loading