Introducde cuda::moments() #3500

cv3d · 2023-05-29T05:10:59Z

Pull Request Readiness Checklist

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
[N/A] There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

cv3d · 2023-05-29T05:13:32Z

@alalek There is a need to compile this with https://github.com/cv3d/opencv/tree/feat/cuda_moments

Is there anything I shall do in such case?

cv3d · 2023-05-29T09:22:22Z

Hi @asmorkalov @cudawarped, can you please help?

The code speed is not that great, despite trying shared memory in the first commit. Actually, it is faster without shared memory. What do you think?

Thanks~

cudawarped · 2023-05-30T05:34:48Z

The code speed is not that great, despite trying shared memory in the first commit. Actually, it is faster without shared memory.

I am not sure I fully understand the shared memory implementation, but it throws

unknown file: error: C++ exception with description "OpenCV(4.7.0-dev) D:\repos\opencv\contrib\modules\cudaimgproc\src\cuda\moments.cu:183: error: (-217:Gpu API call) an illegal memory access was encountered in function 'cv::cuda::device::imgproc::Moments'

on my system. I would expect shared memory to be quicker when the shapes are large and or mainly contained in a single block.

Is there any way you can calculate ComputeCenteroid online (without using the result of moments[m10], moments[m00], moments[m01], moments[m00] ) inside ComputeSpatialMoments to avoid having to launch a separate kernel with a single thread?

What sort of performance are you looking for, on my system the kernel processing time for smaller shapes take ~18 micro seconds and the larger shapes ~55 micro seconds although I suspect with shared memory there should be a lot less difference?

In order of ease and possible performance boost I would first look at:

Improving memory coalescing of the reads. Currently you are reading 16 bytes per warp when the memory transaction size line is usualy 128 bytes. When operating on uint8 its better have blockDim.x == 32 and read one float per thread with each thread processing 4 bytes.
Fix the shared memory implementation.
Using float instead of double. If float accuracy is good enough use that instead of double which is slower. This may have limited affect as the atomic operations probably have the most overhead but in theory coalesced writes to float should be better.
Removing the intermediate ComputeCenteroid kernel if possible. I understand this may not be as efficient, but its just something I would look at if I had time.

cudawarped · 2023-06-01T12:34:40Z

@cv3d So I had a quick look at a shared memory implementation of ComputeSpatialMoments. As I suspected it is quicker as long as the shape is not too small relative to the size of the image. Using atomics on a small number of pixels is faster than writing to shared memory and performing reductions on that memory.

The implementation and tests are 5eb4848 if you want to take a look.

Hopefully the code demonstrates that you can benefit from using shared memory. That said there could be bugs and I suspect there may be shared memory bank conflicts which can be avoided.

I only looked at the first calculation where as you would expect for large images (1920x1920) the CUDA routine is much (100x) faster that the CPU version but for small images (128x128) the CPU version is faster.

I also looked at reading int instead of uchar to improve memory throughput. There appears to be a slight improvement when the image is large enough to saturate the GPU but nothing dramatic as I suspect this kenel is compute bound. If you did use this approach you would have to take care when the image width is not a multiple of 4.

There is definitely room for improvement.

cudawarped · 2023-06-02T12:21:02Z

modules/cudaimgproc/src/cuda/moments.cu

+}
+
+__global__ void ComputeCenteroid(const double* moments, double2* centroid) {
+    centroid->x = moments[m10] / moments[m00];


It may be faster to calculate this every time inside ComputeCenteralMoments rather than launching a seperate kernel.

cv3d · 2023-06-15T03:40:49Z

@cudawarped Your implementation is just amazing. Would you mind to finish your work and make a PR instead of this one? I really do not mind closing this one. All what matters to me is to get such functionality in CUDA, and I will be extremely happy if that is done very well.

cudawarped · 2023-06-15T06:55:33Z

@cudawarped Your implementation is just amazing. Would you mind to finish your work and make a PR instead of this one? I really do not mind closing this one. All what matters to me is to get such functionality in CUDA, and I will be extremely happy if that is done very well.

@cv3d Of course, anything for the person that gave us CUDA Python bindings! I'll take a look at this next week. In the meantime can you take a look at my initial findings and questions below and add any comments findings you have.

From my initial investigation on an RTX 3070 Ti (Ampere) it appears that:

Using shared memory is significantly faster than atomics.
Reading from floats instead of uchar is only faster when the GPU is fully saturated (large images) and the shape is small (shape w/h< 1/4 of the image w/h) relative to the size of the image (when the image is larger the kernels appear to be compute rather than memory bound).
Double-precision is necessary to atain "reasonable" accuracy when calculating higher order and central moments for larger images. That said it can be 10x slower than floating point (on Ampere the performance of double-precision is 1/64 of single-precision) making it only slightly faster than the CPU version for "large" uchar images (GPU saturated) and slower for "small" uchar images.
The CPU moments calculation with uchar data is significantly (up to 5x) faster than binary data with the current CPU implementation.

Given the trade off's above under what circumstances would you use this function on the GPU?

Would the result be used on the host or the device? If its host then to me it would only make sense if it is used after a series of device side funcions where the data is already on the GPU.
Would it be used on large or small images (small images are slower as they do not fully utilize all the SM's on the device, if there is no other GPU work being performed this would be wasteful). I'm asking because I can images passing smaller image ROI's containing shapes from a larger image?
Will the shapes in general represent a large or small portion of the image. Again if this was mainly used on small image ROI's the shape would represent a large portion of the image. This would make reading floats instead of uchar (2) redundant.
Is the creation of this function just a way to speed up calculation of image moments for binary data? If so it would probably be better to invest time in trying to optimize the CPU version, or convert binary to uchar on the host.

asmorkalov · 2023-06-15T07:17:15Z

@cv3d @cudawarped OpenCV team works on release preparation right now. Please harry up with the PR, if you want it to be included into 4.8.0.

cv3d · 2023-06-15T09:51:01Z

@cudawarped Thanks a lot for your willingness and kind words.

I do have an NVIDIA GeForce RTX 3070 GPU, and my limited experiments indicated similar conclusions to yours (except that I did not notice observable differences between float vs uchar input). My initial shared memory was working (somehow it seems I broke it without noticing) but was not efficient so I felt the atomic operations are better, but your is a clear winner. I think the float version is reasonable enough and making it the default might be a good decision, since speed is what we usually want to achieve with GPU. If one wants precision, then maybe specify the type as an additional argument or just use the CPU version? Bye the way, it seems you are disabling the last two implementation, and I find the full reduction using float and coalecsed reads the most efficient, but its cumulative error is also a bit higher.

For my use case, I have 1080 x 1440 input images, and the moments are computed from ROIs (sometimes large ones - depends on distance etc). While the result is for the host, the moments call comes after a series of GPU calls, where downloading the image(s) and calling the CPU version is very costly. Indeed, I was surprised how slow the CPU version for binary input, but the biggest motivation was to get ride of the need to download the image(s) just to compute their moments. With that being said, I think this function will be useful to other use cases aside from mine.

Hope this does answer your questions, but if I missed a point, please let me know.

Thanks a lot~

cudawarped · 2023-06-15T11:01:25Z

except that I did not notice observable differences between float vs uchar input

Bye the way, it seems you are disabling the last two implementation, and I find the full reduction using float and coalecsed reads the most efficient, but its cumulative error is also a bit higher.

My post above was probably confusing. These are the same thing, reading 4 uchars as a float to increase the memory throughput by avoiding cache misses with coalecsed reads .

The problem with the coalesced version is it will only work for ROI's which start on a 4 byte boundry, otherwise the non coaleced version would need to be called or I'd need to try and modify it to work "efficiently" in this case. In my experiments the coalesced version is only faster when the number pixels representing the shape inside the ROI is small compared to the size of the ROI. If this is not a realistic scenario then I can't see a reason for working futher on the coalesced version?

cv3d · 2023-06-16T02:53:04Z

Yes, I naively misunderstood it as merely casting the uchar input image pixels into float. I guess most users will have a high shape ratio to the ROI size, so maybe the coalesced version is not necessary. In my use case, shapes can be long, thin, and rotated, so I really benefit from the coalesced version, but I cannot be greedy to ask for it if it means complicating things.

Thanks a lot @cudawarped, I highly appreciate your help~

cudawarped · 2023-06-16T05:34:17Z

@cv3d Last question. If the result is 99.99% of the time going to be used on the CPU, is there any reason for not calculating the central and normalized moments on the CPU using the cv::Moments constructor (< 1us of CPU time)
after performing the heavy lifting (calculating the spatial moments) on the GPU?

cv3d · 2023-06-22T03:53:22Z

@cudawarped Sorry for being late.

For me, no, no reason to do it all in GPU side. Using cv::Moments constructor is fine by me.
If we stumble on any reason to do it on GPU, we can have a follow PR :)

cudawarped · 2023-06-28T09:36:08Z

@cv3d I should have a version of this for you to test by tomorrow.

cv3d · 2023-07-06T05:47:23Z

Closed in favor of #3516

cv3d changed the title ~~Introducde cuda::moments()~~ WIP: Introducde cuda::moments() May 29, 2023

cv3d marked this pull request as draft May 29, 2023 05:30

cv3d changed the title ~~WIP: Introducde cuda::moments()~~ Introducde cuda::moments() May 29, 2023

cv3d force-pushed the feat/cuda_moments branch from ea0c148 to 2a7b8a8 Compare May 29, 2023 07:30

cv3d marked this pull request as ready for review May 29, 2023 07:30

cv3d force-pushed the feat/cuda_moments branch 2 times, most recently from 46a54b8 to 247f307 Compare May 29, 2023 09:16

cv3d marked this pull request as draft May 29, 2023 09:34

cv3d added 2 commits May 29, 2023 22:02

Introducde cuda::moments()

f5e33eb

It is faster without shared memory

aa3ca5b

cv3d force-pushed the feat/cuda_moments branch from 21fa628 to aa3ca5b Compare May 29, 2023 13:03

cudawarped reviewed Jun 2, 2023

View reviewed changes

cudawarped mentioned this pull request Jun 29, 2023

cuda: add moments #3516

Merged

6 tasks

cv3d closed this Jul 6, 2023

Introducde cuda::moments() #3500

Introducde cuda::moments() #3500

Uh oh!

Conversation

cv3d commented May 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

cv3d commented May 29, 2023

Uh oh!

cv3d commented May 29, 2023

Uh oh!

cudawarped commented May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cudawarped commented Jun 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cudawarped Jun 2, 2023

Choose a reason for hiding this comment

Uh oh!

cv3d commented Jun 15, 2023

Uh oh!

cudawarped commented Jun 15, 2023

Uh oh!

asmorkalov commented Jun 15, 2023

Uh oh!

cv3d commented Jun 15, 2023

Uh oh!

cudawarped commented Jun 15, 2023

Uh oh!

cv3d commented Jun 16, 2023

Uh oh!

cudawarped commented Jun 16, 2023

Uh oh!

cv3d commented Jun 22, 2023

Uh oh!

cudawarped commented Jun 28, 2023

Uh oh!

cv3d commented Jul 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cv3d commented May 29, 2023 •

edited

Loading

cudawarped commented May 30, 2023 •

edited

Loading

cudawarped commented Jun 1, 2023 •

edited

Loading