-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Introducde cuda::moments() #3500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@alalek There is a need to compile this with https://github.com/cv3d/opencv/tree/feat/cuda_moments Is there anything I shall do in such case? |
46a54b8 to
247f307
Compare
|
Hi @asmorkalov @cudawarped, can you please help? The code speed is not that great, despite trying shared memory in the first commit. Actually, it is faster without shared memory. What do you think? Thanks~ |
I am not sure I fully understand the shared memory implementation, but it throws
on my system. I would expect shared memory to be quicker when the shapes are large and or mainly contained in a single block. Is there any way you can calculate What sort of performance are you looking for, on my system the kernel processing time for smaller shapes take ~18 micro seconds and the larger shapes ~55 micro seconds although I suspect with shared memory there should be a lot less difference? In order of ease and possible performance boost I would first look at:
|
|
@cv3d So I had a quick look at a shared memory implementation of The implementation and tests are 5eb4848 if you want to take a look. Hopefully the code demonstrates that you can benefit from using shared memory. That said there could be bugs and I suspect there may be shared memory bank conflicts which can be avoided. I only looked at the first calculation where as you would expect for large images (1920x1920) the CUDA routine is much (100x) faster that the CPU version but for small images (128x128) the CPU version is faster. I also looked at reading There is definitely room for improvement. |
| } | ||
|
|
||
| __global__ void ComputeCenteroid(const double* moments, double2* centroid) { | ||
| centroid->x = moments[m10] / moments[m00]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be faster to calculate this every time inside ComputeCenteralMoments rather than launching a seperate kernel.
|
@cudawarped Your implementation is just amazing. Would you mind to finish your work and make a PR instead of this one? I really do not mind closing this one. All what matters to me is to get such functionality in CUDA, and I will be extremely happy if that is done very well. |
@cv3d Of course, anything for the person that gave us CUDA Python bindings! I'll take a look at this next week. In the meantime can you take a look at my initial findings and questions below and add any comments findings you have. From my initial investigation on an RTX 3070 Ti (Ampere) it appears that:
Given the trade off's above under what circumstances would you use this function on the GPU?
|
|
@cv3d @cudawarped OpenCV team works on release preparation right now. Please harry up with the PR, if you want it to be included into 4.8.0. |
|
@cudawarped Thanks a lot for your willingness and kind words. I do have an NVIDIA GeForce RTX 3070 GPU, and my limited experiments indicated similar conclusions to yours (except that I did not notice observable differences between float vs uchar input). My initial shared memory was working (somehow it seems I broke it without noticing) but was not efficient so I felt the atomic operations are better, but your is a clear winner. I think the float version is reasonable enough and making it the default might be a good decision, since speed is what we usually want to achieve with GPU. If one wants precision, then maybe specify the type as an additional argument or just use the CPU version? Bye the way, it seems you are disabling the last two implementation, and I find the full reduction using float and coalecsed reads the most efficient, but its cumulative error is also a bit higher. For my use case, I have 1080 x 1440 input images, and the moments are computed from ROIs (sometimes large ones - depends on distance etc). While the result is for the host, the moments call comes after a series of GPU calls, where downloading the image(s) and calling the CPU version is very costly. Indeed, I was surprised how slow the CPU version for binary input, but the biggest motivation was to get ride of the need to download the image(s) just to compute their moments. With that being said, I think this function will be useful to other use cases aside from mine. Hope this does answer your questions, but if I missed a point, please let me know. Thanks a lot~ |
My post above was probably confusing. These are the same thing, reading 4 uchars as a float to increase the memory throughput by avoiding cache misses with coalecsed reads . The problem with the coalesced version is it will only work for ROI's which start on a 4 byte boundry, otherwise the non coaleced version would need to be called or I'd need to try and modify it to work "efficiently" in this case. In my experiments the coalesced version is only faster when the number pixels representing the shape inside the ROI is small compared to the size of the ROI. If this is not a realistic scenario then I can't see a reason for working futher on the coalesced version? |
|
Yes, I naively misunderstood it as merely casting the uchar input image pixels into float. I guess most users will have a high shape ratio to the ROI size, so maybe the coalesced version is not necessary. In my use case, shapes can be long, thin, and rotated, so I really benefit from the coalesced version, but I cannot be greedy to ask for it if it means complicating things. Thanks a lot @cudawarped, I highly appreciate your help~ |
|
@cv3d Last question. If the result is 99.99% of the time going to be used on the CPU, is there any reason for not calculating the central and normalized moments on the CPU using the |
|
@cudawarped Sorry for being late. For me, no, no reason to do it all in GPU side. Using |
|
@cv3d I should have a version of this for you to test by tomorrow. |
|
Closed in favor of #3516 |
Pull Request Readiness Checklist
Patch to opencv_extra has the same branch name.