Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@cv3d
Copy link
Contributor

@cv3d cv3d commented May 29, 2023

Pull Request Readiness Checklist

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • [N/A] There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@cv3d
Copy link
Contributor Author

cv3d commented May 29, 2023

@alalek There is a need to compile this with https://github.com/cv3d/opencv/tree/feat/cuda_moments

Is there anything I shall do in such case?

@cv3d cv3d changed the title Introducde cuda::moments() WIP: Introducde cuda::moments() May 29, 2023
@cv3d cv3d marked this pull request as draft May 29, 2023 05:30
@cv3d cv3d changed the title WIP: Introducde cuda::moments() Introducde cuda::moments() May 29, 2023
@cv3d cv3d force-pushed the feat/cuda_moments branch from ea0c148 to 2a7b8a8 Compare May 29, 2023 07:30
@cv3d cv3d marked this pull request as ready for review May 29, 2023 07:30
@cv3d cv3d force-pushed the feat/cuda_moments branch 2 times, most recently from 46a54b8 to 247f307 Compare May 29, 2023 09:16
@cv3d
Copy link
Contributor Author

cv3d commented May 29, 2023

Hi @asmorkalov @cudawarped, can you please help?

The code speed is not that great, despite trying shared memory in the first commit. Actually, it is faster without shared memory. What do you think?

Thanks~

@cv3d cv3d marked this pull request as draft May 29, 2023 09:34
@cv3d cv3d force-pushed the feat/cuda_moments branch from 21fa628 to aa3ca5b Compare May 29, 2023 13:03
@cudawarped
Copy link
Contributor

cudawarped commented May 30, 2023

The code speed is not that great, despite trying shared memory in the first commit. Actually, it is faster without shared memory.

I am not sure I fully understand the shared memory implementation, but it throws

unknown file: error: C++ exception with description "OpenCV(4.7.0-dev) D:\repos\opencv\contrib\modules\cudaimgproc\src\cuda\moments.cu:183: error: (-217:Gpu API call) an illegal memory access was encountered in function 'cv::cuda::device::imgproc::Moments'

on my system. I would expect shared memory to be quicker when the shapes are large and or mainly contained in a single block.

Is there any way you can calculate ComputeCenteroid online (without using the result of moments[m10], moments[m00], moments[m01], moments[m00] ) inside ComputeSpatialMoments to avoid having to launch a separate kernel with a single thread?

What sort of performance are you looking for, on my system the kernel processing time for smaller shapes take ~18 micro seconds and the larger shapes ~55 micro seconds although I suspect with shared memory there should be a lot less difference?

In order of ease and possible performance boost I would first look at:

  1. Improving memory coalescing of the reads. Currently you are reading 16 bytes per warp when the memory transaction size line is usualy 128 bytes. When operating on uint8 its better have blockDim.x == 32 and read one float per thread with each thread processing 4 bytes.
  2. Fix the shared memory implementation.
  3. Using float instead of double. If float accuracy is good enough use that instead of double which is slower. This may have limited affect as the atomic operations probably have the most overhead but in theory coalesced writes to float should be better.
  4. Removing the intermediate ComputeCenteroid kernel if possible. I understand this may not be as efficient, but its just something I would look at if I had time.

@cudawarped
Copy link
Contributor

cudawarped commented Jun 1, 2023

@cv3d So I had a quick look at a shared memory implementation of ComputeSpatialMoments. As I suspected it is quicker as long as the shape is not too small relative to the size of the image. Using atomics on a small number of pixels is faster than writing to shared memory and performing reductions on that memory.

The implementation and tests are 5eb4848 if you want to take a look.

Hopefully the code demonstrates that you can benefit from using shared memory. That said there could be bugs and I suspect there may be shared memory bank conflicts which can be avoided.

I only looked at the first calculation where as you would expect for large images (1920x1920) the CUDA routine is much (100x) faster that the CPU version but for small images (128x128) the CPU version is faster.

I also looked at reading int instead of uchar to improve memory throughput. There appears to be a slight improvement when the image is large enough to saturate the GPU but nothing dramatic as I suspect this kenel is compute bound. If you did use this approach you would have to take care when the image width is not a multiple of 4.

There is definitely room for improvement.

}

__global__ void ComputeCenteroid(const double* moments, double2* centroid) {
centroid->x = moments[m10] / moments[m00];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be faster to calculate this every time inside ComputeCenteralMoments rather than launching a seperate kernel.

@cv3d
Copy link
Contributor Author

cv3d commented Jun 15, 2023

@cudawarped Your implementation is just amazing. Would you mind to finish your work and make a PR instead of this one? I really do not mind closing this one. All what matters to me is to get such functionality in CUDA, and I will be extremely happy if that is done very well.

@cudawarped
Copy link
Contributor

@cudawarped Your implementation is just amazing. Would you mind to finish your work and make a PR instead of this one? I really do not mind closing this one. All what matters to me is to get such functionality in CUDA, and I will be extremely happy if that is done very well.

@cv3d Of course, anything for the person that gave us CUDA Python bindings! I'll take a look at this next week. In the meantime can you take a look at my initial findings and questions below and add any comments findings you have.

From my initial investigation on an RTX 3070 Ti (Ampere) it appears that:

  1. Using shared memory is significantly faster than atomics.
  2. Reading from floats instead of uchar is only faster when the GPU is fully saturated (large images) and the shape is small (shape w/h< 1/4 of the image w/h) relative to the size of the image (when the image is larger the kernels appear to be compute rather than memory bound).
  3. Double-precision is necessary to atain "reasonable" accuracy when calculating higher order and central moments for larger images. That said it can be 10x slower than floating point (on Ampere the performance of double-precision is 1/64 of single-precision) making it only slightly faster than the CPU version for "large" uchar images (GPU saturated) and slower for "small" uchar images.
  4. The CPU moments calculation with uchar data is significantly (up to 5x) faster than binary data with the current CPU implementation.

Given the trade off's above under what circumstances would you use this function on the GPU?

  • Would the result be used on the host or the device? If its host then to me it would only make sense if it is used after a series of device side funcions where the data is already on the GPU.
  • Would it be used on large or small images (small images are slower as they do not fully utilize all the SM's on the device, if there is no other GPU work being performed this would be wasteful). I'm asking because I can images passing smaller image ROI's containing shapes from a larger image?
  • Will the shapes in general represent a large or small portion of the image. Again if this was mainly used on small image ROI's the shape would represent a large portion of the image. This would make reading floats instead of uchar (2) redundant.
  • Is the creation of this function just a way to speed up calculation of image moments for binary data? If so it would probably be better to invest time in trying to optimize the CPU version, or convert binary to uchar on the host.

@asmorkalov
Copy link
Contributor

@cv3d @cudawarped OpenCV team works on release preparation right now. Please harry up with the PR, if you want it to be included into 4.8.0.

@cv3d
Copy link
Contributor Author

cv3d commented Jun 15, 2023

@cudawarped Thanks a lot for your willingness and kind words.

I do have an NVIDIA GeForce RTX 3070 GPU, and my limited experiments indicated similar conclusions to yours (except that I did not notice observable differences between float vs uchar input). My initial shared memory was working (somehow it seems I broke it without noticing) but was not efficient so I felt the atomic operations are better, but your is a clear winner. I think the float version is reasonable enough and making it the default might be a good decision, since speed is what we usually want to achieve with GPU. If one wants precision, then maybe specify the type as an additional argument or just use the CPU version? Bye the way, it seems you are disabling the last two implementation, and I find the full reduction using float and coalecsed reads the most efficient, but its cumulative error is also a bit higher.

For my use case, I have 1080 x 1440 input images, and the moments are computed from ROIs (sometimes large ones - depends on distance etc). While the result is for the host, the moments call comes after a series of GPU calls, where downloading the image(s) and calling the CPU version is very costly. Indeed, I was surprised how slow the CPU version for binary input, but the biggest motivation was to get ride of the need to download the image(s) just to compute their moments. With that being said, I think this function will be useful to other use cases aside from mine.

Hope this does answer your questions, but if I missed a point, please let me know.

Thanks a lot~

@cudawarped
Copy link
Contributor

except that I did not notice observable differences between float vs uchar input

Bye the way, it seems you are disabling the last two implementation, and I find the full reduction using float and coalecsed reads the most efficient, but its cumulative error is also a bit higher.

My post above was probably confusing. These are the same thing, reading 4 uchars as a float to increase the memory throughput by avoiding cache misses with coalecsed reads .

The problem with the coalesced version is it will only work for ROI's which start on a 4 byte boundry, otherwise the non coaleced version would need to be called or I'd need to try and modify it to work "efficiently" in this case. In my experiments the coalesced version is only faster when the number pixels representing the shape inside the ROI is small compared to the size of the ROI. If this is not a realistic scenario then I can't see a reason for working futher on the coalesced version?

@cv3d
Copy link
Contributor Author

cv3d commented Jun 16, 2023

Yes, I naively misunderstood it as merely casting the uchar input image pixels into float. I guess most users will have a high shape ratio to the ROI size, so maybe the coalesced version is not necessary. In my use case, shapes can be long, thin, and rotated, so I really benefit from the coalesced version, but I cannot be greedy to ask for it if it means complicating things.

Thanks a lot @cudawarped, I highly appreciate your help~

@cudawarped
Copy link
Contributor

@cv3d Last question. If the result is 99.99% of the time going to be used on the CPU, is there any reason for not calculating the central and normalized moments on the CPU using the cv::Moments constructor (< 1us of CPU time)
after performing the heavy lifting (calculating the spatial moments) on the GPU?

@cv3d
Copy link
Contributor Author

cv3d commented Jun 22, 2023

@cudawarped Sorry for being late.

For me, no, no reason to do it all in GPU side. Using cv::Moments constructor is fine by me.
If we stumble on any reason to do it on GPU, we can have a follow PR :)

@cudawarped
Copy link
Contributor

@cv3d I should have a version of this for you to test by tomorrow.

@cudawarped cudawarped mentioned this pull request Jun 29, 2023
6 tasks
@cv3d
Copy link
Contributor Author

cv3d commented Jul 6, 2023

Closed in favor of #3516

@cv3d cv3d closed this Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants