Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@fengyuentau
Copy link
Member

@fengyuentau fengyuentau commented Mar 4, 2024

Depends on #25872.

Merge wtih opencv/opencv_extra#1189.

Part of the acceleration of ViTs inference with dnn.

Perf

Tested on Apple M1 with test case Layer_Elementwise.elementwise/0. Data in milliseconds.

Khadas VIM4 (A311D2 SoC)

Geometric mean (ms)

                             Name of Test                                 gelu   gelu.patch gelu.patch
                                                                                                vs
                                                                                               gelu
                                                                                            (x-factor)
VIT_B_32::DNNTestNetwork::OCV/CPU                                       245.312   233.760      1.05
elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/CPU)       1.604     0.938       1.71

Intel i7-12700K

Geometric mean (ms)

                             Name of Test                                gelu   gelu.patch gelu.patch
                                                                                               vs
                                                                                              gelu
                                                                                           (x-factor)
VIT_B_32::DNNTestNetwork::OCV/CPU                                       40.041    38.357      1.04
elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/CPU)       0.111    0.037       2.99

Apple M1

Geometric mean (ms)

                             Name of Test                                gelu  gelu.patch gelu.patch
                                                                                              vs    
                                                                                             gelu   
                                                                                          (x-factor)
VIT_B_32::DNNTestNetwork::OCV/CPU                                       85.868   75.852      1.13   
VIT_B_32::DNNTestNetwork::OCV/CPU_FP16                                  84.684   74.188      1.14   
elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/CPU)      0.734    0.119       6.17   
elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/CPU_FP16) 0.769    0.123       6.24

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=Linux OpenCL,Win64 OpenCL

@vpisarev
Copy link
Contributor

vpisarev commented Mar 4, 2024

@fengyuentau, excellent! is there such a dramatic difference really?

btw, I played a bit with 'erf' approximation, as GELU can be computed exactly via erf: GELU(x) = x/2*(1 + erf(x/sqrt(2))) and found that erf can be computed very accurately without any v_select() and without v_exp():

#include <math.h>
#include <algorithm>
#include <stdio.h>

int main(int argc, char** argv) {
    float maxerr = 0.f;
    for (int i = -100000; i < 100000000; i++) {
        float x0 = i*0.0001f;
        float y0 = erff(x0);
        float x = fabsf(x0), sx = x0 >= 0 ? 1.f : -1.f;
        // see https://en.wikipedia.org/wiki/Error_function; one of Abramowitz and Stegun approximations
        float d = (((((0.0000430638f*x + 0.0002765672f)*x + 0.0001520143f)*x +
            0.0092705272f)*x + 0.0422820123f)*x + 0.0705230784f)*x + 1.f;
        d = 1.f/d;        
        d *= d; d *= d;
        d *= d; d *= d;
        float y1 = sx*(1.f - d);
        float err = fabsf(y0 - y1);
        maxerr = std::max(maxerr, err);
    }
    printf("maxerr=%.3g\n", maxerr);
    return 0;
}

note that extracting the sign sx and then putting it back (sx*...) can be done with element-wise v_and() and v_or() intrinsics and the sign mask 0x80000000 - no select is needed here.

Please, try it out.

@fengyuentau
Copy link
Member Author

is there such a dramatic difference really?

I will test on other platforms as well to see whether they can reach the same improvement.


I did not notice erf(-x) = -erf(x) (it is indeed true since erf is an odd function) in the beginning. Let me try your implementation.

@vpisarev
Copy link
Contributor

vpisarev commented Mar 5, 2024

this is another implementation that follows yet another Abramowitz & Stegun approximation and also matches vectorized version from PyTorch (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec512/vec512_float.h#L187-L218):

inline float fast_erf(float x) {
   float sx = x >= 0 ? 1.f : -1.f;
   float t = 1.f/fmaf(fabsf(x), 0.3275911f, 1.f);
   float r = fmaf(1.061405429f, t, -1.453152027f);
   r = fmaf(r, t, 1.421413741f);
   r = fmaf(r, t, -0.284496736f);
   r = fmaf(r, t, 0.254829592f);
   return sx*(1.f - r*t*expf(-x*x));
}

I guess, it should be slower than the previous version that I suggested, but 1) it matches PyTorch and 2) it is more accurate, especially around 0.

@fengyuentau

This comment was marked as resolved.

@fengyuentau

This comment was marked as resolved.

@fengyuentau

This comment was marked as resolved.

@fengyuentau
Copy link
Member Author

fengyuentau commented Mar 5, 2024

@vpisarev With accuracy issue resolved (hopefully), the updated performance results are

#####
# M1
#####
# Paddle
[ PERFSTAT ]    (samples=100   mean=0.13   median=0.13   min=0.11   stddev=0.02 (13.2%))
# Wiki
[ PERFSTAT ]    (samples=100   mean=0.07   median=0.07   min=0.06   stddev=0.01 (11.5%))
# PyTorch
[ PERFSTAT ]    (samples=100   mean=0.10   median=0.09   min=0.08   stddev=0.01 (14.3%))

#####
# i7-12700K
#####
# Paddle
[ PERFSTAT ]    (samples=100   mean=0.05   median=0.05   min=0.05   stddev=0.00 (0.4%))
# Wiki
[ PERFSTAT ]    (samples=100   mean=0.02   median=0.02   min=0.02   stddev=0.00 (1.5%))
# PyTorch
[ PERFSTAT ]    (samples=100   mean=0.04   median=0.04   min=0.04   stddev=0.00 (0.5%))

Note that the method from wiki does not need v_exp.

@vpisarev
Copy link
Contributor

vpisarev commented Mar 5, 2024

@fengyuentau, thank you for the detailed experiments! What about accuracy (absolute and relative)? how far are those implementations from std::erf()?

@fengyuentau
Copy link
Member Author

All the dnn accuracy tests are now green. I don't know how to give absolute and relative accuracy though.

@vpisarev
Copy link
Contributor

vpisarev commented Mar 5, 2024

ok, I found that PyTorch and Paddle Paddle versions are very close to each other. Paddle Paddle is slightly more accurate, but the difference is small. They both are noticeably more accurate than the fastest 'exp-less' formula, especially around 0, but when you compute GELU, i.e. multiply by 'x*0.5' in the end, the drop in accuracy is not as noticeable.

May I suggest to drop Paddle Paddle version to keep the source more compact? 'wiki' version might be preserved, just in case, but I would use 'PyTorch' approximation as the default option.

@vpisarev
Copy link
Contributor

vpisarev commented Mar 5, 2024

also, please rewrite the implementation using scalable universal intrinsics. This way we could get even better performance if we move those kernels to separate dynamically dispatched source files and compile them with AVX2/AVX512/RVV

@fengyuentau
Copy link
Member Author

ok, I found that PyTorch and Paddle Paddle versions are very close to each other. Paddle Paddle is slightly more accurate, but the difference is small. They both are noticeably more accurate than the fastest 'exp-less' formula, especially around 0, but when you compute GELU, i.e. multiply by 'x*0.5' in the end, the drop in accuracy is not as noticeable.

May I suggest to drop Paddle Paddle version to keep the source more compact? 'wiki' version might be preserved, just in case, but I would use 'PyTorch' approximation as the default option.

Thank you for the test. I think I can do it next time comparing difference between accuracy of implementations. Let's keep paddle and wiki version in the git history and keep the pytorch one in the code only.


also, please rewrite the implementation using scalable universal intrinsics. This way we could get even better performance if we move those kernels to separate dynamically dispatched source files and compile them with AVX2/AVX512/RVV

I am not sure whether I do it in the right way by only changing all v_setall_f32 to vx_setall_f32. I don't find a straight example on the usage of scalable intrinsics.

v_float32 half = vx_setall_f32(0.5f),
one = vx_setall_f32(1.0f),
reciprocal_sqrt2 = vx_setall_f32(M_SQRT1_2);
for (; i <= len - vlanes * 4; i += vlanes * 4) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's quite heavy function inside. 1) does it make sense to unroll the loop by vlanes*4? 2) with such aggressive unrolling there is a risk that we will have very long tail which will slow down performance significantly. I suggest to make unrolling less aggressive. 2) it's element-wise operation. Often in NCHW the product of H and W may be quite odd number far from power-of-two. You should process (cn1-cn0)*planeSize as a single 1D array, so that you will have just one tail, not many tiles

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with unrolling by 2 but did not observe significant difference.


You should process (cn1-cn0)*planeSize as a single 1D array, so that you will have just one tail, not many tiles

It is not completely correct. The loops and parallel look like the following:

for b in batch:
  for c in channel:
    for i in h * w: (this most inner loop is parallelled by the number of threads)
      // ...

So for each thread the workload should be b * c * stripeSize (planeSize is the step to the next segment), which is (cn1 - cn0) * len using terms in the code.

This parallelism is used across all activations in the file.

@fengyuentau fengyuentau changed the title dnn: improve speed of activation layers via vectorization for ViTs dnn: accelerate gelu via vectorized erf Mar 26, 2024
@asmorkalov asmorkalov modified the milestones: 4.10.0, 4.11.0 May 16, 2024
@fengyuentau fengyuentau force-pushed the dnn/elementwise_layers/speedup branch from 9c67f3a to 50ccf83 Compare July 3, 2024 07:33
@fengyuentau fengyuentau marked this pull request as ready for review July 3, 2024 07:34
@fengyuentau
Copy link
Member Author

v_erf_approx is only used in gelu. Maybe we do not need to put it in univeral intrinsics.

@fengyuentau
Copy link
Member Author

v_erf_approx is only used in gelu. Maybe we do not need to put it in univeral intrinsics.

Discussed and decided to put v_erf in universal intrinsics.

@fengyuentau fengyuentau force-pushed the dnn/elementwise_layers/speedup branch from 6cd8ef5 to 25ba16a Compare July 5, 2024 02:55
@fengyuentau fengyuentau mentioned this pull request Jul 5, 2024
6 tasks
@asmorkalov
Copy link
Contributor

asmorkalov commented Jul 5, 2024

The optimization totally makes sense!
Jetson tk1 (armv7+neon):

elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/CPU)                                                                               6.855      2.593      2.64 

Core i5-2500K (AVX, no AVX2):

elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/CPU)                                                                               1.256      0.434      2.89   
elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/OCL)                                                                               0.265      0.258      1.03   
elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/OCL_FP16)                                                                          0.261      0.257      1.02  

asmorkalov pushed a commit that referenced this pull request Jul 5, 2024
core: add v_erf #25872

This patch adds v_erf, which is needed by #25147.

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [x] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake
@asmorkalov
Copy link
Contributor

@fengyuentau please rebase and switch to new v_erf. Manual port to 5.x is required too.

@fengyuentau fengyuentau force-pushed the dnn/elementwise_layers/speedup branch from 25ba16a to 7c5df99 Compare July 6, 2024 15:55
@asmorkalov asmorkalov self-requested a review July 7, 2024 08:44
@asmorkalov
Copy link
Contributor

@vpisarev please take a look again.

Copy link
Contributor

@asmorkalov asmorkalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@vpisarev
Copy link
Contributor

vpisarev commented Jul 8, 2024

@fengyuentau, let's delay integration of this PR a bit while @WanliZhong will add vectorized v_erf(). What do you think?

@fengyuentau
Copy link
Member Author

@fengyuentau, let's delay integration of this PR a bit while @WanliZhong will add vectorized v_erf(). What do you think?

It has been done in #25872.

@vpisarev
Copy link
Contributor

vpisarev commented Jul 8, 2024

cool!

@vpisarev vpisarev requested review from asmorkalov and vpisarev July 8, 2024 09:02
@asmorkalov asmorkalov merged commit e3858cc into opencv:4.x Jul 8, 2024
@fengyuentau fengyuentau deleted the dnn/elementwise_layers/speedup branch July 9, 2024 07:01
@asmorkalov asmorkalov mentioned this pull request Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants