Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

fengyuentau
Copy link
Member

@fengyuentau fengyuentau commented May 23, 2024

This PR introduces the following changes:

  • Parallelize binary forward impl
  • Parallelize ternary forward impl (Where)
  • Parallelize nary (Operator that can take >=1 operands)
  • Enable conformance tests if workable

Performance

i7-12700K, RAM 64GB, Ubuntu 22.04

Geometric mean (ms)

                Name of Test                     opencv        opencv        opencv
                                                  perf          perf          perf
                                              core.x64.0606 core.x64.0606 core.x64.0606
                                                                               vs
                                                                             opencv
                                                                              perf
                                                                          core.x64.0606
                                                                           (x-factor)
NCHW_C_sum::Layer_NaryEltwise::OCV/CPU           16.116        11.161         1.44
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU        17.469        11.446         1.53
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU        17.531        11.469         1.53
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU      28.653        13.682         2.09
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU    21.899        13.422         1.63
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU       21.738        13.185         1.65
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU        16.172        11.473         1.41
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU       16.309        11.565         1.41
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU        16.166        11.454         1.41
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU        16.157        11.443         1.41
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU        163.459       15.234         10.73
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU    10.880        10.868         1.00
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU    10.947        11.058         0.99
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU    10.948        10.910         1.00
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU    10.874        10.871         1.00
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU    10.971        10.920         1.00
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU        17.546        11.462         1.53
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU        16.175        11.475         1.41
NHWC_C::Layer_NaryEltwise::OCV/CPU               11.339        11.333         1.00
NHWC_H::Layer_NaryEltwise::OCV/CPU               16.154        11.102         1.46

Apple M1, RAM 16GB, macOS 14.4.1

Geometric mean (ms)

                Name of Test                     opencv          opencv             opencv      
                                                  perf            perf               perf       
                                              core.m1.0606 core.m1.0606.patch core.m1.0606.patch
                                                                                      vs        
                                                                                    opencv      
                                                                                     perf       
                                                                                 core.m1.0606   
                                                                                  (x-factor)    
NCHW_C_sum::Layer_NaryEltwise::OCV/CPU           28.418          3.768               7.54       
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU        6.942           5.679               1.22       
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU        5.822           5.653               1.03       
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU      5.751           5.628               1.02       
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU    5.797           5.599               1.04       
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU       7.272           5.578               1.30       
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU        5.777           5.562               1.04       
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU       5.819           5.559               1.05       
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU        5.830           5.574               1.05       
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU        5.759           5.567               1.03       
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU       342.260          74.655              4.58       
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU    8.338           8.280               1.01       
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU    8.359           8.309               1.01       
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU    8.412           8.295               1.01       
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU    8.380           8.297               1.01       
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU    8.356           8.323               1.00       
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU        6.818           5.561               1.23       
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU        5.805           5.570               1.04       
NHWC_C::Layer_NaryEltwise::OCV/CPU               3.834           4.817               0.80       
NHWC_H::Layer_NaryEltwise::OCV/CPU               28.402          3.771               7.53

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

Comment on lines 690 to 706
double nstripes = getNumThreads();
parallel_for_(Range(0, nplanes), worker, nstripes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nstripes = getNumThreads();

This should not be used.
Already discussed several months ago - e.g. #23047

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review but take it easy, this pr is still drafting. I still remember our discussion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Performance results are also updated.

@fengyuentau fengyuentau added this to the 4.11.0 milestone Jun 3, 2024
@fengyuentau fengyuentau marked this pull request as ready for review June 6, 2024 10:07
@fengyuentau fengyuentau requested a review from dkurt June 7, 2024 04:34
@asmorkalov
Copy link
Contributor

My results with Jetson tk1 (armv7+neon):

ubuntu@jetson1:~/Projects/perf-dnn$ python3 ../opencv/modules/ts/misc/summary.py ./4.x-1.xml ./patched-1.xml | grep NaryEltwise
NCHW_C_sum::Layer_NaryEltwise::OCV/CPU                                                                                                          65.891   43.371      1.52   
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU                                                                                                       79.287   81.868      0.97   
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU                                                                                                      187.457   187.657     1.00   
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU                                                                                                     88.643   96.376      0.92   
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU                                                                                                   88.694   96.035      0.92   
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU                                                                                                      88.716   90.298      0.98   
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU                                                                                                       84.722   83.976      1.01   
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU                                                                                                      92.757   81.105      1.14   
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU                                                                                                       84.285   84.010      1.00   
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU                                                                                                       78.594   78.574      1.00   
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU                                                                                                      3407.037 3475.724     0.98   
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU                                                                                                  189.651   189.454     1.00   
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU                                                                                                   87.859   87.771      1.00   
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU                                                                                                   87.915   88.053      1.00   
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU                                                                                                   84.077   84.063      1.00   
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU                                                                                                   85.160   84.625      1.01   
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU                                                                                                       86.368   79.089      1.09   
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU                                                                                                       89.897   78.993      1.14   
NHWC_C::Layer_NaryEltwise::OCV/CPU                                                                                                              77.220   71.425      1.08   
NHWC_H::Layer_NaryEltwise::OCV/CPU                                                                                                              67.494   42.832      1.58

@asmorkalov
Copy link
Contributor

asmorkalov commented Jun 11, 2024

My results for Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz (no AVX2):

NCHW_C_sum::Layer_NaryEltwise::OCV/CPU                                                                                                          24.193   17.846      1.36   
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU                                                                                                       24.026   23.313      1.03   
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU                                                                                                       27.370   23.279      1.18   
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU                                                                                                     35.025   23.254      1.51   
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU                                                                                                   32.455   23.260      1.40   
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU                                                                                                      32.509   23.321      1.39   
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU                                                                                                       23.997   23.262      1.03   
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU                                                                                                      24.038   23.270      1.03   
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU                                                                                                       23.977   23.269      1.03   
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU                                                                                                       23.927   23.279      1.03   
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU                                                                                                      320.598   98.029      3.27   
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU                                                                                                   24.507   24.488      1.00   
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU                                                                                                   24.484   24.477      1.00   
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU                                                                                                   24.500   24.471      1.00   
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU                                                                                                   24.486   24.482      1.00   
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU                                                                                                   24.472   24.476      1.00   
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU                                                                                                       23.953   23.281      1.03   
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU                                                                                                       23.992   23.274      1.03   
NHWC_C::Layer_NaryEltwise::OCV/CPU                                                                                                              18.260   18.489      0.99   
NHWC_H::Layer_NaryEltwise::OCV/CPU                                                                                                              24.182   17.829      1.36

@fengyuentau
Copy link
Member Author

Thank you @asmorkalov for adding more performance results :)

@fengyuentau
Copy link
Member Author

Any review comments?

@fengyuentau fengyuentau changed the title dnn: parallelize nary elementwise forward implementation dnn: parallelize nary elementwise forward implementation & enable related conformance tests Jun 14, 2024
@asmorkalov
Copy link
Contributor

The patch leads to significant OpenCL pipelines degradation, e.g.:

VIT_B_32::DNNTestNetwork::OCV/CPU 	149.576 	191.409 	0.78
VIT_B_32::DNNTestNetwork::OCV/OCL 	104.428 	445.013 	0.23
VIT_B_32::DNNTestNetwork::OCV/OCL_FP16 	102.505 	442.994 	0.23 

I use NVIDIA GF 1080 for benchmark. Looks like the patch prevents some graph fusing or some inference optimization.
Looking into details, if it really caused by the PR.

@fengyuentau
Copy link
Member Author

The patch leads to significant OpenCL pipelines degradation, e.g.:

VIT_B_32::DNNTestNetwork::OCV/CPU 	149.576 	191.409 	0.78
VIT_B_32::DNNTestNetwork::OCV/OCL 	104.428 	445.013 	0.23
VIT_B_32::DNNTestNetwork::OCV/OCL_FP16 	102.505 	442.994 	0.23 

I use NVIDIA GF 1080 for benchmark. Looks like the patch prevents some graph fusing or some inference optimization. Looking into details, if it really caused by the PR.

Ok, I will take a look at the problem.

@fengyuentau
Copy link
Member Author

fengyuentau commented Jun 24, 2024

@asmorkalov The performance "degradation" is due to very out-of-date code base (>450 commits behind 4.x). I have updated the code base. Performance testings (on Intel UHD 770) seem to be okay on my side. Feel free to retest on your side.


Thinking positively, we have achieved a lot performance boosting from those commits (OCL is ~4x faster and CPU is ~1.3x faster). Maybe I can add the OCL backend for this layer later :)

@vpisarev vpisarev self-requested a review June 27, 2024 21:29
@asmorkalov
Copy link
Contributor

asmorkalov commented Jun 28, 2024

perf-dnn.zip
OpenCL related degradation disappeared. Perf numbers for updated PR for core i5-2500:

NCHW_C_sum::Layer_NaryEltwise::OCV/CPU 	24.142 	17.999 	1.34
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU 	23.860 	23.265 	1.03
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU 	27.383 	23.282 	1.18
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU 	39.056 	23.292 	1.68
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU 	32.489 	23.290 	1.39
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU 	32.435 	23.257 	1.39
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU 	23.966 	23.269 	1.03
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU 	23.992 	23.276 	1.03
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU 	23.951 	23.273 	1.03
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU 	23.862 	23.272 	1.03
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU 	320.265 	97.879 	3.27
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU 	24.491 	24.487 	1.00
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU 	24.463 	24.464 	1.00
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU 	24.472 	24.465 	1.00
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU 	24.460 	24.453 	1.00
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU 	24.463 	24.530 	1.00
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU 	23.870 	23.271 	1.03
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU 	23.964 	23.764 	1.01
NHWC_C::Layer_NaryEltwise::OCV/CPU 	18.083 	18.458 	0.98
NHWC_H::Layer_NaryEltwise::OCV/CPU 	24.140 	17.857 	1.35 

@asmorkalov
Copy link
Contributor

I also tried Xiaomi Mi 10 phone. The result is volatile (m.b. power management), but I do not see significant performance gain, besides NCHW_C_sum and NCHW_NCHW_pow.
perf-dnn-xiaomi-mi10.zip

@fengyuentau
Copy link
Member Author

The result is volatile (m.b. power management), but I do not see significant performance gain

It is tuned to have multi-theading if input scale is large enough. Traditional convolutional nets do not have such a large input scale for elementwise layers.

@asmorkalov asmorkalov merged commit a7fd944 into opencv:4.x Jul 3, 2024
@fengyuentau fengyuentau mentioned this pull request Jul 12, 2024
6 tasks
@asmorkalov asmorkalov added the port/backport done Label for maintainers. Authors of PR can ignore this label Jul 12, 2024
asmorkalov pushed a commit that referenced this pull request Jul 15, 2024
…_thread

dnn: merge #25630 to 5.x #25900

Sync changes from #25630 to 5.x.

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [x] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake
@asmorkalov asmorkalov mentioned this pull request Jul 16, 2024
@fengyuentau fengyuentau deleted the nary-multi-thread branch July 30, 2024 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: dnn optimization port/backport done Label for maintainers. Authors of PR can ignore this

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants