dnn (opencl): integrate bias handling in the inner product opencl kernel #24840

fengyuentau · 2024-01-10T08:50:48Z

Preliminary of OpenCL backend revise.

force_builders=Linux OpenCL,Win64 OpenCL

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

fengyuentau · 2024-01-10T09:43:44Z

By the way, @dkurt do you know when use_half can be true? See following code:

opencv/modules/dnn/src/layers/fully_connected_layer.cpp

Lines 355 to 362 in 75dc334

    
           bool forward_ocl(InputArrayOfArrays inps, OutputArrayOfArrays outs, InputArrayOfArrays internals) 
        
           { 
        
               std::vector<UMat> inputs; 
        
               std::vector<UMat> outputs; 
        
               bool use_half = (inps.depth() == CV_16S); 
        
               inps.getUMatVector(inputs); 
        
               outs.getUMatVector(outputs);

I tested both on i7-12700K and M1 with WITH_OPENCL=ON, and use_half is always false regardless OCV/OCL or OCV/OCL_FP16. Also inputs[0].depth() and inps.depth() are always CV_32F.

opencv-alalek · 2024-01-10T10:51:27Z

and use_half is always false

This test has call with use_half = true:

DNNTestNetwork.AlexNet/1, where GetParam() = OCV/OCL_FP16

fengyuentau · 2024-01-10T10:59:15Z

and use_half is always false

This test has call with use_half = true:

DNNTestNetwork.AlexNet/1, where GetParam() = OCV/OCL_FP16

Not in my environment. Anything I missed?

opencv-alalek · 2024-01-10T11:43:32Z

Ensure that correct OpenCL device is selected (e.g. using OPENCV_OPENCL_DEVICE=":GPU:1"):

[ INFO:[email protected]] global ocl.cpp:1185 haveOpenCL Initialize OpenCL runtime...
[ INFO:[email protected]] global ocl.cpp:1191 haveOpenCL OpenCL: found 3 platforms
[ INFO:[email protected]] global ocl.cpp:983 getInitializedExecutionContext OpenCL: initializing thread execution context
[ INFO:[email protected]] global ocl.cpp:993 getInitializedExecutionContext OpenCL: creating new execution context...
[ INFO:[email protected]] global ocl.cpp:1011 getInitializedExecutionContext OpenCL: device=Intel(R) UHD Graphics 770
CTEST_FULL_OUTPUT
OpenCV version: 4.9.0-dev
OpenCV VCS version: 4.9.0-25-g8a950db4e9-dirty
Build type: Debug
Compiler: /usr/lib64/ccache/c++  (ver 13.2.1)
[ INFO:[email protected]] global registry_parallel.impl.hpp:96 ParallelBackendRegistry core(parallel): Enabled backends(3, sorted by priority): ONETBB(1000); TBB(990); OPENMP(980)
Parallel framework: pthreads (nthreads=2)
CPU features: SSE SSE2 SSE3 *SSE4.1 *SSE4.2 *FP16 *AVX *AVX2 *AVX512-SKX?
Intel(R) IPP version: ippIP AVX2 (l9) 2021.10.0 (-) Sep 18 2023
Intel(R) IPP features code: 0x8000
OpenCL Platforms: 
    AMD Accelerated Parallel Processing
        dGPU: gfx1031 (OpenCL 2.0 )
    Intel(R) OpenCL Graphics
        iGPU: Intel(R) UHD Graphics 770 (OpenCL 3.0 NEO )
    Portable Computing Language
        CPU: cpu-12th Gen Intel(R) Core(TM) i7-12700K (OpenCL 3.0 PoCL HSTR: cpu-x86_64-redhat-linux-gnu-alderlake)
Current OpenCL device: 
    Type = iGPU
    Name = Intel(R) UHD Graphics 770
    Version = OpenCL 3.0 NEO 
    Driver version = 23.35.27191.9
    Address bits = 64
    Compute units = 32
    Max work group size = 512
    Local memory size = 64 KB
    Max memory allocation size = 3 GB 1023 MB 1016 KB
    Double support = No
    Half support = Yes
    Host unified memory = Yes
    Device extensions:
        cl_khr_byte_addressable_store
        cl_khr_device_uuid
        cl_khr_fp16
        cl_khr_global_int32_base_atomics
        cl_khr_global_int32_extended_atomics
        cl_khr_icd
        cl_khr_local_int32_base_atomics
        cl_khr_local_int32_extended_atomics
        cl_intel_command_queue_families
        cl_intel_subgroups
        cl_intel_required_subgroup_size
        cl_intel_subgroups_short
        cl_khr_spir
        cl_intel_accelerator
        cl_intel_driver_diagnostics
        cl_khr_priority_hints
        cl_khr_throttle_hints
        cl_khr_create_command_queue
        cl_intel_subgroups_char
        cl_intel_subgroups_long
        cl_khr_il_program
        cl_intel_mem_force_host_memory
        cl_khr_subgroup_extended_types
        cl_khr_subgroup_non_uniform_vote
        cl_khr_subgroup_ballot
        cl_khr_subgroup_non_uniform_arithmetic
        cl_khr_subgroup_shuffle
        cl_khr_subgroup_shuffle_relative
        cl_khr_subgroup_clustered_reduce
        cl_intel_device_attribute_query
        cl_khr_suggested_local_work_size
        cl_intel_split_work_group_barrier
        cl_intel_spirv_media_block_io
        cl_intel_spirv_subgroups
        cl_khr_spirv_linkonce_odr
        cl_khr_spirv_no_integer_wrap_decoration
        cl_intel_unified_shared_memory
        cl_khr_mipmap_image
        cl_khr_mipmap_image_writes
        cl_ext_float_atomics
        cl_khr_external_memory
        cl_intel_planar_yuv
        cl_intel_packed_yuv
        cl_khr_int64_base_atomics
        cl_khr_int64_extended_atomics
        cl_khr_image2d_from_buffer
        cl_khr_depth_images
        cl_khr_3d_image_writes
        cl_intel_media_block_io
        cl_intel_subgroup_local_block_io
        cl_khr_integer_dot_product
        cl_khr_gl_sharing
        cl_khr_gl_depth_images
        cl_khr_gl_event
        cl_khr_gl_msaa_sharing
        cl_intel_va_api_media_sharing
        cl_intel_sharing_format_query
        cl_khr_pci_bus_info
    Has AMD Blas = No
    Has AMD Fft = No
    Preferred vector width char = 16
    Preferred vector width short = 8
    Preferred vector width int = 4
    Preferred vector width long = 1
    Preferred vector width float = 1
    Preferred vector width double = 0
    Preferred vector width half = 8

Also ensure to use Intel compute runtime: https://github.com/intel/compute-runtime/releases
Alternative guidelines: https://dgpu-docs.intel.com/driver/installation.html

P.S. Avoid using screenshots with text information.

fengyuentau · 2024-01-10T14:35:40Z

It does have use_half=true from default (Linux OpenCL):

[ RUN      ] DNNTestNetwork.AlexNet/0, where GetParam() = OCV/OCL
[ INFO:[email protected]] global ocl.cpp:5369 __init_buffer_pools OpenCL: Initializing buffer pool for context@0 with max capacity: poolSize=134217728 poolSizeHostPtr=134217728
[ INFO:[email protected]] global ocl.cpp:409 OpenCLBinaryCacheConfigurator Successfully initialized OpenCL cache directory: /build/.cache/opencv_opencl_cache_x64/
[ INFO:[email protected]] global ocl.cpp:433 prepareCacheDirectoryForContext Preparing OpenCL cache configuration for context: Intel_R__Corporation--Intel_R__UHD_Graphics_730__0x4682_--22_28_23726_1
use_half=0, A.depth()=5, inputs_arr.depth()=5
use_half=0, A.depth()=5, inputs_arr.depth()=5
use_half=0, A.depth()=5, inputs_arr.depth()=5
use_half=0, A.depth()=5, inputs_arr.depth()=5
use_half=0, A.depth()=5, inputs_arr.depth()=5
use_half=0, A.depth()=5, inputs_arr.depth()=5
[ INFO:[email protected]] global ts.cpp:857 testTearDown Memory_usage (OpenCL): 266418708 (base=0  current=266418708)
[       OK ] DNNTestNetwork.AlexNet/0 (787 ms)
[ RUN      ] DNNTestNetwork.AlexNet/1, where GetParam() = OCV/OCL_FP16
use_half=1, A.depth()=3, inputs_arr.depth()=3
use_half=1, A.depth()=3, inputs_arr.depth()=3
use_half=1, A.depth()=3, inputs_arr.depth()=3
use_half=1, A.depth()=3, inputs_arr.depth()=3
use_half=1, A.depth()=3, inputs_arr.depth()=3
use_half=1, A.depth()=3, inputs_arr.depth()=3

Let me try to enable this in my environment tomorrow. Thanks for the instructions.

modules/dnn/src/layers/fully_connected_layer.cpp

dkurt · 2024-01-10T19:10:36Z

modules/dnn/src/ocl4dnn/src/ocl4dnn_inner_product.cpp

+                cv::gemm(biasOneMat, newbias, 1, tmpTop, 1, tmpTop, 0);
+                convertFp16(tmpTop, top);
+            } else {
+                UMat biasOnesMat = UMat::ones(M_, 1, CV_32F);


Is that correct to use ones for FP32 too? BTW, can you remind why ones were used for FP16?

Is that correct to use ones for FP32 too?

If I am not mistaking, FP16 data is casted back to FP32 to call cv::gemm(). This is done for FP16 and it should also work for FP32.

can you remind why ones were used for FP16?

To use cv::gemm() for bias addition with the assumption that bias has shape [N] or [1, N].

Y = alpha * A * B + beta * C
=> alpha = beta = 1, Y = A * B + C
=> A=ones<M, 1>, B=bias<1, N>, Y = bias<M, N> + C<M, N>

It does not work if bias has shape [M, 1] or [M, N]. But OCL4DNNInnerProduct is only used in InnerProduct layer in fully_connected_layer.cpp for now.

I will open another pull request adding opencl backend implementation for Gemm layer in gemm_layer.cpp.

fengyuentau · 2024-01-11T04:31:51Z

Ensure that correct OpenCL device is selected (e.g. using OPENCV_OPENCL_DEVICE=":GPU:1"):

[ INFO:[email protected]] global ocl.cpp:1185 haveOpenCL Initialize OpenCL runtime...
[ INFO:[email protected]] global ocl.cpp:1191 haveOpenCL OpenCL: found 3 platforms
[ INFO:[email protected]] global ocl.cpp:983 getInitializedExecutionContext OpenCL: initializing thread execution context
[ INFO:[email protected]] global ocl.cpp:993 getInitializedExecutionContext OpenCL: creating new execution context...
[ INFO:[email protected]] global ocl.cpp:1011 getInitializedExecutionContext OpenCL: device=Intel(R) UHD Graphics 770
CTEST_FULL_OUTPUT
OpenCV version: 4.9.0-dev
OpenCV VCS version: 4.9.0-25-g8a950db4e9-dirty
Build type: Debug
Compiler: /usr/lib64/ccache/c++  (ver 13.2.1)
[ INFO:[email protected]] global registry_parallel.impl.hpp:96 ParallelBackendRegistry core(parallel): Enabled backends(3, sorted by priority): ONETBB(1000); TBB(990); OPENMP(980)
Parallel framework: pthreads (nthreads=2)
CPU features: SSE SSE2 SSE3 *SSE4.1 *SSE4.2 *FP16 *AVX *AVX2 *AVX512-SKX?
Intel(R) IPP version: ippIP AVX2 (l9) 2021.10.0 (-) Sep 18 2023
Intel(R) IPP features code: 0x8000
OpenCL Platforms: 
    AMD Accelerated Parallel Processing
        dGPU: gfx1031 (OpenCL 2.0 )
    Intel(R) OpenCL Graphics
        iGPU: Intel(R) UHD Graphics 770 (OpenCL 3.0 NEO )
    Portable Computing Language
        CPU: cpu-12th Gen Intel(R) Core(TM) i7-12700K (OpenCL 3.0 PoCL HSTR: cpu-x86_64-redhat-linux-gnu-alderlake)
Current OpenCL device: 
    Type = iGPU
    Name = Intel(R) UHD Graphics 770
    Version = OpenCL 3.0 NEO 
    Driver version = 23.35.27191.9
    Address bits = 64
    Compute units = 32
    Max work group size = 512
    Local memory size = 64 KB
    Max memory allocation size = 3 GB 1023 MB 1016 KB
    Double support = No
    Half support = Yes
    Host unified memory = Yes
    Device extensions:
        cl_khr_byte_addressable_store
        cl_khr_device_uuid
        cl_khr_fp16
        cl_khr_global_int32_base_atomics
        cl_khr_global_int32_extended_atomics
        cl_khr_icd
        cl_khr_local_int32_base_atomics
        cl_khr_local_int32_extended_atomics
        cl_intel_command_queue_families
        cl_intel_subgroups
        cl_intel_required_subgroup_size
        cl_intel_subgroups_short
        cl_khr_spir
        cl_intel_accelerator
        cl_intel_driver_diagnostics
        cl_khr_priority_hints
        cl_khr_throttle_hints
        cl_khr_create_command_queue
        cl_intel_subgroups_char
        cl_intel_subgroups_long
        cl_khr_il_program
        cl_intel_mem_force_host_memory
        cl_khr_subgroup_extended_types
        cl_khr_subgroup_non_uniform_vote
        cl_khr_subgroup_ballot
        cl_khr_subgroup_non_uniform_arithmetic
        cl_khr_subgroup_shuffle
        cl_khr_subgroup_shuffle_relative
        cl_khr_subgroup_clustered_reduce
        cl_intel_device_attribute_query
        cl_khr_suggested_local_work_size
        cl_intel_split_work_group_barrier
        cl_intel_spirv_media_block_io
        cl_intel_spirv_subgroups
        cl_khr_spirv_linkonce_odr
        cl_khr_spirv_no_integer_wrap_decoration
        cl_intel_unified_shared_memory
        cl_khr_mipmap_image
        cl_khr_mipmap_image_writes
        cl_ext_float_atomics
        cl_khr_external_memory
        cl_intel_planar_yuv
        cl_intel_packed_yuv
        cl_khr_int64_base_atomics
        cl_khr_int64_extended_atomics
        cl_khr_image2d_from_buffer
        cl_khr_depth_images
        cl_khr_3d_image_writes
        cl_intel_media_block_io
        cl_intel_subgroup_local_block_io
        cl_khr_integer_dot_product
        cl_khr_gl_sharing
        cl_khr_gl_depth_images
        cl_khr_gl_event
        cl_khr_gl_msaa_sharing
        cl_intel_va_api_media_sharing
        cl_intel_sharing_format_query
        cl_khr_pci_bus_info
    Has AMD Blas = No
    Has AMD Fft = No
    Preferred vector width char = 16
    Preferred vector width short = 8
    Preferred vector width int = 4
    Preferred vector width long = 1
    Preferred vector width float = 1
    Preferred vector width double = 0
    Preferred vector width half = 8

Also ensure to use Intel compute runtime: https://github.com/intel/compute-runtime/releases Alternative guidelines: https://dgpu-docs.intel.com/driver/installation.html

P.S. Avoid using screenshots with text information.

I followed these steps and installed all these packages a while ago, it works with the integrated GPU in i7-12700K previously. Now it does not work anymore since I installed GTX 1080Ti in the system with CUDA 12.

I followed the same steps re-installing everything, but still it does not work. ~~Also looks like OCL_FP16 is not working correctly with OpenCL 3.0 CUDA.~~ OpenCL 3.0 CUDA (and Apple OpenCL) still does not support ocl_khr_fp16.

fengyuentau · 2024-01-11T05:53:30Z

@opencv-alalek Problem solved. It was because iGPU is automatically disabled by BIOS when a discrete GPU (NVIDIA GTX 1080Ti in my case) is installed. Solution is enable iGPU in BIOS.

fengyuentau added the category: dnn label Jan 10, 2024

fengyuentau added this to the 4.10.0 milestone Jan 10, 2024

fengyuentau requested a review from dkurt January 10, 2024 08:50

opencv-alalek added category: ocl optimization labels Jan 10, 2024

dkurt reviewed Jan 10, 2024

View reviewed changes

modules/dnn/src/layers/fully_connected_layer.cpp Outdated Show resolved Hide resolved

dkurt reviewed Jan 10, 2024

View reviewed changes

integrate bias handling in ocl kernel

83acb65

fengyuentau force-pushed the ocl_innerproduct branch from 5d5c53c to 83acb65 Compare January 11, 2024 03:15

dkurt approved these changes Jan 12, 2024

View reviewed changes

asmorkalov merged commit 97c418a into opencv:4.x Jan 12, 2024

asmorkalov assigned dkurt Jan 12, 2024

fengyuentau deleted the ocl_innerproduct branch January 12, 2024 12:12

This was referenced Jan 19, 2024

5.x merge 4.x #24862

Merged

5.x merge 4.x #24912

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

dnn (opencl): integrate bias handling in the inner product opencl kernel #24840

dnn (opencl): integrate bias handling in the inner product opencl kernel #24840

Uh oh!

fengyuentau commented Jan 10, 2024

Uh oh!

fengyuentau commented Jan 10, 2024 •

edited

Loading

Uh oh!

opencv-alalek commented Jan 10, 2024

Uh oh!

fengyuentau commented Jan 10, 2024

Uh oh!

opencv-alalek commented Jan 10, 2024

Uh oh!

fengyuentau commented Jan 10, 2024 •

edited

Loading

Uh oh!

Uh oh!

dkurt Jan 10, 2024

Uh oh!

fengyuentau Jan 11, 2024

Uh oh!

fengyuentau commented Jan 11, 2024 •

edited

Loading

Uh oh!

fengyuentau commented Jan 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

dnn (opencl): integrate bias handling in the inner product opencl kernel #24840

dnn (opencl): integrate bias handling in the inner product opencl kernel #24840

Uh oh!

Conversation

fengyuentau commented Jan 10, 2024

Pull Request Readiness Checklist

Uh oh!

fengyuentau commented Jan 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

opencv-alalek commented Jan 10, 2024

Uh oh!

fengyuentau commented Jan 10, 2024

Uh oh!

opencv-alalek commented Jan 10, 2024

Uh oh!

fengyuentau commented Jan 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dkurt Jan 10, 2024

Choose a reason for hiding this comment

Uh oh!

fengyuentau Jan 11, 2024

Choose a reason for hiding this comment

Uh oh!

fengyuentau commented Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fengyuentau commented Jan 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fengyuentau commented Jan 10, 2024 •

edited

Loading

fengyuentau commented Jan 10, 2024 •

edited

Loading

fengyuentau commented Jan 11, 2024 •

edited

Loading