Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

WanliZhong
Copy link
Member

@WanliZhong WanliZhong commented Apr 23, 2023

This PR will fix #23278

Current implement is a temp impl. I will try to make more eltwise broadcast cases support CUDA.

The inference time of model is from 26.7651 ms to 17.8416 ms.

perf_test result
run this script to generate result

.\bin\opencv_perf_dnn.exe '--gtest_filter=CUDA/Layer_NaryEltwise.*/*:CUDA/Layer_NaryEltwise/*.*' --gtest_output=xml:../tmp/1th.xml --perf_threads=1

use this script to generate summary

python ../modules/ts/misc/summary.py -m min 1th.xml 0th.xml -o markdown

result

Name of Test 0th 1th 1th vs 0th (x-factor)
NHWC_H::CUDA/Layer_NaryEltwise::CUDA/CUDA 39.003 (fallback to cpu) 17.936 2.17

Layer by layer data:

  • before being fixed
    onnx_node!ResNet18/0_conv/Conv2D   0.1515ms
    onnx_node!ResNet18/0_PReLU/Relu   0.0193ms
    onnx_node!ResNet18/0_PReLU/Neg_1   0.0145ms
    onnx_node!ResNet18/0_PReLU/Relu_1   0.0121ms
    ResNet18/0_PReLU/Neg:0   0.0167ms
    onnx_node!ResNet18/0_PReLU/mul   2.071ms
    onnx_node!ResNet18/0_PReLU/add   0.0585ms
    onnx_node!ResNet18/stack1_block1_shortcut_conv/Conv2D   0.1643ms
    onnx_node!ResNet18/stack1_block1_1_bn/FusedBatchNormV3   0.0166ms
    onnx_node!ResNet18/stack1_block1_1_conv/Conv2D   0.1179ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/Relu   0.0192ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/Neg_1   0.0114ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/Relu_1   0.0095ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/mul   4.0522ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/add   0.0857ms
    onnx_node!ResNet18/stack1_block1_2_conv/Conv2D   0.1803ms
    onnx_node!ResNet18/stack1_block2_1_bn/FusedBatchNormV3   0.013ms
    onnx_node!ResNet18/stack1_block2_1_conv/Conv2D   0.0533ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/Relu   0.0145ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/Neg_1   0.0116ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/Relu_1   0.0093ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/mul   2.346ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/add   0.0483ms
    onnx_node!ResNet18/stack1_block2_2_conv/Conv2D   0.0748ms
    onnx_node!ResNet18/stack2_block1_shortcut_conv/Conv2D   0.1015ms
    onnx_node!ResNet18/stack2_block1_1_bn/FusedBatchNormV3   0.0135ms
    onnx_node!ResNet18/stack2_block1_1_conv/Conv2D   0.0639ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/Relu   0.0161ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/Neg_1   0.0137ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/Relu_1   0.0133ms
    ResNet18/stack2_block2_2_PReLU/Neg:0   0.0177ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/mul   2.7318ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/add   0.0643ms
    onnx_node!ResNet18/stack2_block1_2_conv/Conv2D   0.1083ms
    onnx_node!ResNet18/stack2_block2_1_bn/FusedBatchNormV3   0.0139ms
    onnx_node!ResNet18/stack2_block2_1_conv/Conv2D   0.0496ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/Relu   0.0147ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/Neg_1   0.0115ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/Relu_1   0.0096ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/mul   1.79ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/add   0.045ms
    onnx_node!ResNet18/stack2_block2_2_conv/Conv2D   0.0701ms
    onnx_node!ResNet18/stack3_block1_shortcut_conv/Conv2D   0.0776ms
    onnx_node!ResNet18/stack3_block1_1_bn/FusedBatchNormV3   0.016ms
    onnx_node!ResNet18/stack3_block1_1_conv/Conv2D   0.0479ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/Relu   0.0159ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/Neg_1   0.0135ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/Relu_1   0.0121ms
    ResNet18/stack3_block2_2_PReLU/Neg:0   0.0173ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/mul   2.1251ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/add   0.043ms
    onnx_node!ResNet18/stack3_block1_2_conv/Conv2D   0.0793ms
    onnx_node!ResNet18/stack3_block2_1_bn/FusedBatchNormV3   0.012ms
    onnx_node!ResNet18/stack3_block2_1_conv/Conv2D   0.0458ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/Relu   0.0133ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/Neg_1   0.0106ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/Relu_1   0.0091ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/mul   1.7766ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/add   0.043ms
    onnx_node!ResNet18/stack3_block2_2_conv/Conv2D   0.0751ms
    onnx_node!ResNet18/stack4_block1_shortcut_conv/Conv2D   0.0758ms
    onnx_node!ResNet18/stack4_block1_1_bn/FusedBatchNormV3   0.0153ms
    onnx_node!ResNet18/stack4_block1_1_conv/Conv2D   0.048ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/Relu   0.0151ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/Neg_1   0.013ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/Relu_1   0.012ms
    ResNet18/stack4_block1_2_PReLU/Neg:0   0.0176ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/mul   2.1163ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/add   0.0396ms
    onnx_node!ResNet18/stack4_block1_2_conv/Conv2D   0.0751ms
    onnx_node!ResNet18/stack4_block2_1_bn/FusedBatchNormV3   0.0121ms
    onnx_node!ResNet18/stack4_block2_1_conv/Conv2D   0.0485ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/Relu   0.0158ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/Neg_1   0.013ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/Relu_1   0.0121ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/mul   2.0351ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/add   0.037ms
    onnx_node!ResNet18/stack4_block2_2_conv/Conv2D   0.072ms
    onnx_node!ResNet18/E_batchnorm/FusedBatchNormV3   0.0142ms
    onnx_node!ResNet18/E_batchnorm/FusedBatchNormV3__210   0.0169ms
    onnx_node!ResNet18/E_flatten/Reshape   0.0014ms
    onnx_node!ResNet18/E_dense/MatMul   0.0445ms
    ResNet18/E_batchnorm/ReadVariableOp_1:0   0.0165ms
    onnx_node!ResNet18/pre_embedding/batchnorm/mul_1   0.0156ms
    embedding   0.001ms
  • after being fixed
    onnx_node!ResNet18/0_conv/Conv2D   0.255ms
    onnx_node!ResNet18/0_PReLU/Relu   0.0309ms
    onnx_node!ResNet18/0_PReLU/Neg_1   0.0181ms
    onnx_node!ResNet18/0_PReLU/Relu_1   0.0147ms
    ResNet18/0_PReLU/Neg:0   0.0539ms
    onnx_node!ResNet18/0_PReLU/mul   0.0276ms
    onnx_node!ResNet18/0_PReLU/add   0.018ms
    onnx_node!ResNet18/stack1_block1_shortcut_conv/Conv2D   0.1718ms
    onnx_node!ResNet18/stack1_block1_1_bn/FusedBatchNormV3   0.0215ms
    onnx_node!ResNet18/stack1_block1_1_conv/Conv2D   0.1762ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/Relu   0.0201ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/Neg_1   0.0156ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/Relu_1   0.0142ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/mul   0.0199ms
    onnx_node!ResNet18/stack1_block1_2_PReLU/add   0.0478ms
    onnx_node!ResNet18/stack1_block1_2_conv/Conv2D   0.1198ms
    onnx_node!ResNet18/stack1_block2_1_bn/FusedBatchNormV3   0.0139ms
    onnx_node!ResNet18/stack1_block2_1_conv/Conv2D   0.2334ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/Relu   0.0244ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/Neg_1   0.0238ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/Relu_1   0.0196ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/mul   0.0256ms
    onnx_node!ResNet18/stack1_block2_2_PReLU/add   0.0204ms
    onnx_node!ResNet18/stack1_block2_2_conv/Conv2D   0.1101ms
    onnx_node!ResNet18/stack2_block1_shortcut_conv/Conv2D   0.1641ms
    onnx_node!ResNet18/stack2_block1_1_bn/FusedBatchNormV3   0.0296ms
    onnx_node!ResNet18/stack2_block1_1_conv/Conv2D   0.0867ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/Relu   0.0253ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/Neg_1   0.0223ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/Relu_1   0.0208ms
    ResNet18/stack2_block2_2_PReLU/Neg:0   0.0337ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/mul   0.0334ms
    onnx_node!ResNet18/stack2_block1_2_PReLU/add   0.0306ms
    onnx_node!ResNet18/stack2_block1_2_conv/Conv2D   0.1605ms
    onnx_node!ResNet18/stack2_block2_1_bn/FusedBatchNormV3   0.0266ms
    onnx_node!ResNet18/stack2_block2_1_conv/Conv2D   0.0904ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/Relu   0.0712ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/Neg_1   0.0305ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/Relu_1   0.0237ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/mul   0.0299ms
    onnx_node!ResNet18/stack2_block2_2_PReLU/add   0.0257ms
    onnx_node!ResNet18/stack2_block2_2_conv/Conv2D   0.1648ms
    onnx_node!ResNet18/stack3_block1_shortcut_conv/Conv2D   0.147ms
    onnx_node!ResNet18/stack3_block1_1_bn/FusedBatchNormV3   0.0269ms
    onnx_node!ResNet18/stack3_block1_1_conv/Conv2D   0.0805ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/Relu   0.0274ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/Neg_1   0.0214ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/Relu_1   0.0969ms
    ResNet18/stack3_block2_2_PReLU/Neg:0   0.03ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/mul   0.0272ms
    onnx_node!ResNet18/stack3_block1_2_PReLU/add   0.0247ms
    onnx_node!ResNet18/stack3_block1_2_conv/Conv2D   0.1316ms
    onnx_node!ResNet18/stack3_block2_1_bn/FusedBatchNormV3   0.0241ms
    onnx_node!ResNet18/stack3_block2_1_conv/Conv2D   0.0792ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/Relu   0.0259ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/Neg_1   0.0213ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/Relu_1   0.0962ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/mul   0.0633ms
    onnx_node!ResNet18/stack3_block2_2_PReLU/add   0.0246ms
    onnx_node!ResNet18/stack3_block2_2_conv/Conv2D   0.1131ms
    onnx_node!ResNet18/stack4_block1_shortcut_conv/Conv2D   0.1028ms
    onnx_node!ResNet18/stack4_block1_1_bn/FusedBatchNormV3   0.0273ms
    onnx_node!ResNet18/stack4_block1_1_conv/Conv2D   0.0834ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/Relu   0.031ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/Neg_1   0.1031ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/Relu_1   0.0858ms
    ResNet18/stack4_block1_2_PReLU/Neg:0   0.032ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/mul   0.0333ms
    onnx_node!ResNet18/stack4_block1_2_PReLU/add   0.0229ms
    onnx_node!ResNet18/stack4_block1_2_conv/Conv2D   0.1609ms
    onnx_node!ResNet18/stack4_block2_1_bn/FusedBatchNormV3   0.0336ms
    onnx_node!ResNet18/stack4_block2_1_conv/Conv2D   0.0869ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/Relu   0.0314ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/Neg_1   0.0235ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/Relu_1   0.0236ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/mul   0.0368ms
    onnx_node!ResNet18/stack4_block2_2_PReLU/add   0.0234ms
    onnx_node!ResNet18/stack4_block2_2_conv/Conv2D   0.1913ms
    onnx_node!ResNet18/E_batchnorm/FusedBatchNormV3   0.0269ms
    onnx_node!ResNet18/E_batchnorm/FusedBatchNormV3__210   0.0234ms
    onnx_node!ResNet18/E_flatten/Reshape   0.0016ms
    onnx_node!ResNet18/E_dense/MatMul   0.1472ms
    ResNet18/E_batchnorm/ReadVariableOp_1:0   0.0635ms
    onnx_node!ResNet18/pre_embedding/batchnorm/mul_1   0.0692ms
    embedding   0.0019ms

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@WanliZhong WanliZhong added bug category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib category: dnn labels Apr 23, 2023
@WanliZhong WanliZhong added this to the 4.8.0 milestone Apr 23, 2023
@WanliZhong WanliZhong requested a review from zihaomu April 23, 2023 09:53
@WanliZhong WanliZhong self-assigned this Apr 23, 2023
@WanliZhong WanliZhong changed the title make 'abcd op 1b11' broadcast support cuda DNN/CUDA: make 'abcd op 1b11' broadcast eltwise operator support cuda Apr 23, 2023
@asmorkalov
Copy link
Contributor

@WanliZhong In case if you get the results with OpenCV perf tests then you can use modules/ts/misc/summary.py to generate accurate performance comparison report. Just run the test before the patch and after the patch with `--gtest_output=xml:<xml_file_name> and run the script with two or more reports.

@zihaomu
Copy link
Member

zihaomu commented Apr 24, 2023

Hi @asmorkalov, thanks for your reminder. I will tell Wanli how to do this performance test.

@WanliZhong
Copy link
Member Author

Thanks, @asmorkalov . I have updated the summary results

run this script to generate result

.\bin\opencv_perf_dnn.exe '--gtest_filter=CUDA/Layer_NaryEltwise.*/*:CUDA/Layer_NaryEltwise/*.*' --gtest_output=xml:../tmp/1th.xml --perf_threads=1

use this script to generate summary

python ../modules/ts/misc/summary.py -m min 1th.xml 0th.xml -o markdown

result

Name of Test 0th 1th 1th vs 0th (x-factor)
NHWC_H::CUDA/Layer_NaryEltwise::CUDA/CUDA 39.003 (fallback to cpu) 17.936 2.17

Copy link
Member

@zihaomu zihaomu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@asmorkalov asmorkalov merged commit e3e1f70 into opencv:4.x Apr 24, 2023
@WanliZhong WanliZhong deleted the issue23278 branch May 16, 2023 12:33
@asmorkalov asmorkalov mentioned this pull request May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug category: dnn category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance Loss from OpenCV 4.5.5 to 4.7.0 using CUDA backend

3 participants