Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

Haosonn
Copy link
Contributor

@Haosonn Haosonn commented Dec 25, 2023

We improve Vulkan backend for NaryEltwiseLayer in DNN module by:

  • add a basic framework for Vulkan backend in NaryEltwiseLayer
  • add a compute shader for binary forwarding (an imitation of what has been done in native OpenCV backend including broadcasting and eltwise-operation)
  • typo fixed:
    • Wrong info output in context.cpp

Currently, our implementation (or all layers supporting Vulkan backend) runs pretty slow on discrete GPUs basically due to IO cost in function copyToHost, and we are going to fix that by

  • find out the best VkMemoryProperty for various discrete GPUs

  • prevent copyToHost in middle layers during forwarding, (i.e keep data in GPU memory)

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@zihaomu
Copy link
Member

zihaomu commented Dec 26, 2023

Hi @Haosonn, thanks for your contribution!

Currently, our implementation (or all layers supporting Vulkan backend) runs pretty slow on discrete GPUs basically due to IO cost in function copyToHost.

Yes. Previously patch of vulkan, I just focused on the Integrated graphics. Our Vulkan backend still needs a lot of optimization. In my opinion, the first priority is supporting more layers, so that we could reduce the number of calling copyToHost. And the optimized of discrete GPUs, could be done at lower priority. There are two reasons for this: 1. we have CUDA backend for discrete GPUs, 2. fast discrete GPUs need full VkImage pipeline, more complicated than VkBuffer.

prevent copyToHost in middle layers during forwarding, (i.e keep data in GPU memory)

It's hard to do so, we can not predict if the next layer of NaryEltwiseLayer was supported by Vulkan. Some fast transfer strategy like MNN's vulkan, they have two different implementations: VkBuffer and VkImage. And the VkImage is much faster on data transfering of GPU-CPU.

@asmorkalov asmorkalov added this to the 4.10.0 milestone Jan 9, 2024
Copy link
Member

@fengyuentau fengyuentau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zihaomu Please review this PR as well.

@Haosonn Haosonn force-pushed the pre-pr-2 branch 3 times, most recently from 4ae98b5 to 836f0d1 Compare January 15, 2024 03:52
@fengyuentau
Copy link
Member

Several tests failed:

  • objdetect:
     [RUN      ] Objdetect_face_detection.regression
    
  • video:
    [  FAILED  ] NanoTrack.accuracy_NanoTrack_V1
    [  FAILED  ] NanoTrack.accuracy_NanoTrack_V2
    

Also see https://pullrequest.opencv.org/buildbot/builders/precommit_linux64/builds/105934/steps/test_objdetect/logs/stdio, which looks like memory issues.

@fengyuentau fengyuentau requested a review from vpisarev January 19, 2024 07:23
@asmorkalov
Copy link
Contributor

@Haosonn @fengyuentau please rebase and fix conflicts.

@asmorkalov
Copy link
Contributor

@zihaomu @fengyuentau Could you take a look again?

Copy link
Member

@fengyuentau fengyuentau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM πŸ‘ Thanks for the contribution!

Copy link
Member

@zihaomu zihaomu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! πŸ‘

@asmorkalov asmorkalov merged commit 87f7492 into opencv:4.x Jan 29, 2024
JStech pushed a commit to JStech/opencv that referenced this pull request Jan 29, 2024
Vulkan backend for NaryEltwiseLayer in DNN module opencv#24768

We improve Vulkan backend for ``NaryEltwiseLayer`` in DNN module by:

- add a basic framework for Vulkan backend in ``NaryEltwiseLayer``
- add a compute shader for binary forwarding (an imitation of what has been done in native OpenCV backend including broadcasting and eltwise-operation)
- typo fixed:
  - Wrong info output in ``context.cpp``

Currently, our implementation (or all layers supporting Vulkan backend) runs pretty slow on discrete GPUs basically due to IO cost in function ``copyToHost``, and we are going to fix that by

- find out the best ``VkMemoryProperty`` for various discrete GPUs

- prevent ``copyToHost`` in middle layers during forwarding, (i.e keep data in GPU memory)
### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [ ] The feature is well documented and sample code can be built with the project CMake

Co-authored-by: IskXCr <[email protected]>
@opencv-alalek
Copy link
Contributor

This patch cause FP16 test failures: #24954

This was referenced Feb 3, 2024
@opencv-alalek
Copy link
Contributor

I see performance degradation for this test case with 1/2/4 threads (no threading in implementation anyway) on 12700K:

Name of Test base patch x-factor
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU 157.908 169.737 0.93

To reviewers: PRs with optimization or other non-trivial implementation changes should have attached performance reports.

@fengyuentau
Copy link
Member

I see performance degradation for this test case with 1/2/4 threads (no threading in implementation anyway) on 12700K:

Name of Test base patch x-factor
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU 157.908 169.737 0.93
To reviewers: PRs with optimization or other non-trivial implementation changes should have attached performance reports.

Pow is not supported yet in Vulkan backend. So I guess something else happened?

@opencv-alalek
Copy link
Contributor

There is regression on CPU, not Vulkan.

@fengyuentau
Copy link
Member

It looks weirder to me that this patch did very limited changes on the CPU implementation but yet affected the CPU performance, specifically Pow only. Let me investigate it.

@fengyuentau
Copy link
Member

fengyuentau commented Feb 9, 2024

Update: Oh, I see, use --perf_min_samples=100. I thought it was some kind of environment variable.


@opencv-alalek Do you know how to force opencv_perf_* running 100 samples? I found they can run 10 to 100 samples, which may lead to some mistakes.

@dkurt
Copy link
Member

dkurt commented Feb 9, 2024

@fengyuentau , there is TEST_CYCLE_N but it marked as deprecated (but it works for individual tests):

TEST_CYCLE_N(100)
{
…
}

Or you may use --perf_min_samples=100 --perf_force_samples=100:

--perf_min_samples (value:10)
    minimal required numer of samples
--perf_force_samples (value:100)
    force set maximum number of samples for all tests

Sorry, I missed the thing that you already found --perf_min_samples

@Haosonn Haosonn deleted the pre-pr-2 branch March 20, 2025 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants