Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@fengyuentau
Copy link
Member

Preliminary of OpenCL backend revise.

force_builders=Linux OpenCL,Win64 OpenCL

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@fengyuentau fengyuentau added this to the 4.10.0 milestone Jan 10, 2024
@fengyuentau fengyuentau requested a review from dkurt January 10, 2024 08:50
@fengyuentau
Copy link
Member Author

fengyuentau commented Jan 10, 2024

By the way, @dkurt do you know when use_half can be true? See following code:

bool forward_ocl(InputArrayOfArrays inps, OutputArrayOfArrays outs, InputArrayOfArrays internals)
{
std::vector<UMat> inputs;
std::vector<UMat> outputs;
bool use_half = (inps.depth() == CV_16S);
inps.getUMatVector(inputs);
outs.getUMatVector(outputs);

I tested both on i7-12700K and M1 with WITH_OPENCL=ON, and use_half is always false regardless OCV/OCL or OCV/OCL_FP16. Also inputs[0].depth() and inps.depth() are always CV_32F.

@opencv-alalek
Copy link
Contributor

and use_half is always false

This test has call with use_half = true:

DNNTestNetwork.AlexNet/1, where GetParam() = OCV/OCL_FP16

@fengyuentau
Copy link
Member Author

and use_half is always false

This test has call with use_half = true:

DNNTestNetwork.AlexNet/1, where GetParam() = OCV/OCL_FP16

image

Not in my environment. Anything I missed?

@opencv-alalek
Copy link
Contributor

Ensure that correct OpenCL device is selected (e.g. using OPENCV_OPENCL_DEVICE=":GPU:1"):

[ INFO:[email protected]] global ocl.cpp:1185 haveOpenCL Initialize OpenCL runtime...
[ INFO:[email protected]] global ocl.cpp:1191 haveOpenCL OpenCL: found 3 platforms
[ INFO:[email protected]] global ocl.cpp:983 getInitializedExecutionContext OpenCL: initializing thread execution context
[ INFO:[email protected]] global ocl.cpp:993 getInitializedExecutionContext OpenCL: creating new execution context...
[ INFO:[email protected]] global ocl.cpp:1011 getInitializedExecutionContext OpenCL: device=Intel(R) UHD Graphics 770
CTEST_FULL_OUTPUT
OpenCV version: 4.9.0-dev
OpenCV VCS version: 4.9.0-25-g8a950db4e9-dirty
Build type: Debug
Compiler: /usr/lib64/ccache/c++  (ver 13.2.1)
[ INFO:[email protected]] global registry_parallel.impl.hpp:96 ParallelBackendRegistry core(parallel): Enabled backends(3, sorted by priority): ONETBB(1000); TBB(990); OPENMP(980)
Parallel framework: pthreads (nthreads=2)
CPU features: SSE SSE2 SSE3 *SSE4.1 *SSE4.2 *FP16 *AVX *AVX2 *AVX512-SKX?
Intel(R) IPP version: ippIP AVX2 (l9) 2021.10.0 (-) Sep 18 2023
Intel(R) IPP features code: 0x8000
OpenCL Platforms: 
    AMD Accelerated Parallel Processing
        dGPU: gfx1031 (OpenCL 2.0 )
    Intel(R) OpenCL Graphics
        iGPU: Intel(R) UHD Graphics 770 (OpenCL 3.0 NEO )
    Portable Computing Language
        CPU: cpu-12th Gen Intel(R) Core(TM) i7-12700K (OpenCL 3.0 PoCL HSTR: cpu-x86_64-redhat-linux-gnu-alderlake)
Current OpenCL device: 
    Type = iGPU
    Name = Intel(R) UHD Graphics 770
    Version = OpenCL 3.0 NEO 
    Driver version = 23.35.27191.9
    Address bits = 64
    Compute units = 32
    Max work group size = 512
    Local memory size = 64 KB
    Max memory allocation size = 3 GB 1023 MB 1016 KB
    Double support = No
    Half support = Yes
    Host unified memory = Yes
    Device extensions:
        cl_khr_byte_addressable_store
        cl_khr_device_uuid
        cl_khr_fp16
        cl_khr_global_int32_base_atomics
        cl_khr_global_int32_extended_atomics
        cl_khr_icd
        cl_khr_local_int32_base_atomics
        cl_khr_local_int32_extended_atomics
        cl_intel_command_queue_families
        cl_intel_subgroups
        cl_intel_required_subgroup_size
        cl_intel_subgroups_short
        cl_khr_spir
        cl_intel_accelerator
        cl_intel_driver_diagnostics
        cl_khr_priority_hints
        cl_khr_throttle_hints
        cl_khr_create_command_queue
        cl_intel_subgroups_char
        cl_intel_subgroups_long
        cl_khr_il_program
        cl_intel_mem_force_host_memory
        cl_khr_subgroup_extended_types
        cl_khr_subgroup_non_uniform_vote
        cl_khr_subgroup_ballot
        cl_khr_subgroup_non_uniform_arithmetic
        cl_khr_subgroup_shuffle
        cl_khr_subgroup_shuffle_relative
        cl_khr_subgroup_clustered_reduce
        cl_intel_device_attribute_query
        cl_khr_suggested_local_work_size
        cl_intel_split_work_group_barrier
        cl_intel_spirv_media_block_io
        cl_intel_spirv_subgroups
        cl_khr_spirv_linkonce_odr
        cl_khr_spirv_no_integer_wrap_decoration
        cl_intel_unified_shared_memory
        cl_khr_mipmap_image
        cl_khr_mipmap_image_writes
        cl_ext_float_atomics
        cl_khr_external_memory
        cl_intel_planar_yuv
        cl_intel_packed_yuv
        cl_khr_int64_base_atomics
        cl_khr_int64_extended_atomics
        cl_khr_image2d_from_buffer
        cl_khr_depth_images
        cl_khr_3d_image_writes
        cl_intel_media_block_io
        cl_intel_subgroup_local_block_io
        cl_khr_integer_dot_product
        cl_khr_gl_sharing
        cl_khr_gl_depth_images
        cl_khr_gl_event
        cl_khr_gl_msaa_sharing
        cl_intel_va_api_media_sharing
        cl_intel_sharing_format_query
        cl_khr_pci_bus_info
    Has AMD Blas = No
    Has AMD Fft = No
    Preferred vector width char = 16
    Preferred vector width short = 8
    Preferred vector width int = 4
    Preferred vector width long = 1
    Preferred vector width float = 1
    Preferred vector width double = 0
    Preferred vector width half = 8

Also ensure to use Intel compute runtime: https://github.com/intel/compute-runtime/releases
Alternative guidelines: https://dgpu-docs.intel.com/driver/installation.html

P.S. Avoid using screenshots with text information.

@fengyuentau
Copy link
Member Author

fengyuentau commented Jan 10, 2024

It does have use_half=true from default (Linux OpenCL):

[ RUN      ] DNNTestNetwork.AlexNet/0, where GetParam() = OCV/OCL
[ INFO:[email protected]] global ocl.cpp:5369 __init_buffer_pools OpenCL: Initializing buffer pool for context@0 with max capacity: poolSize=134217728 poolSizeHostPtr=134217728
[ INFO:[email protected]] global ocl.cpp:409 OpenCLBinaryCacheConfigurator Successfully initialized OpenCL cache directory: /build/.cache/opencv_opencl_cache_x64/
[ INFO:[email protected]] global ocl.cpp:433 prepareCacheDirectoryForContext Preparing OpenCL cache configuration for context: Intel_R__Corporation--Intel_R__UHD_Graphics_730__0x4682_--22_28_23726_1
use_half=0, A.depth()=5, inputs_arr.depth()=5
use_half=0, A.depth()=5, inputs_arr.depth()=5
use_half=0, A.depth()=5, inputs_arr.depth()=5
use_half=0, A.depth()=5, inputs_arr.depth()=5
use_half=0, A.depth()=5, inputs_arr.depth()=5
use_half=0, A.depth()=5, inputs_arr.depth()=5
[ INFO:[email protected]] global ts.cpp:857 testTearDown Memory_usage (OpenCL): 266418708 (base=0  current=266418708)
[       OK ] DNNTestNetwork.AlexNet/0 (787 ms)
[ RUN      ] DNNTestNetwork.AlexNet/1, where GetParam() = OCV/OCL_FP16
use_half=1, A.depth()=3, inputs_arr.depth()=3
use_half=1, A.depth()=3, inputs_arr.depth()=3
use_half=1, A.depth()=3, inputs_arr.depth()=3
use_half=1, A.depth()=3, inputs_arr.depth()=3
use_half=1, A.depth()=3, inputs_arr.depth()=3
use_half=1, A.depth()=3, inputs_arr.depth()=3

Let me try to enable this in my environment tomorrow. Thanks for the instructions.

cv::gemm(biasOneMat, newbias, 1, tmpTop, 1, tmpTop, 0);
convertFp16(tmpTop, top);
} else {
UMat biasOnesMat = UMat::ones(M_, 1, CV_32F);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that correct to use ones for FP32 too? BTW, can you remind why ones were used for FP16?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that correct to use ones for FP32 too?

If I am not mistaking, FP16 data is casted back to FP32 to call cv::gemm(). This is done for FP16 and it should also work for FP32.

can you remind why ones were used for FP16?

To use cv::gemm() for bias addition with the assumption that bias has shape [N] or [1, N].

Y = alpha * A * B + beta * C
=> alpha = beta = 1, Y = A * B + C
=> A=ones<M, 1>, B=bias<1, N>, Y = bias<M, N> + C<M, N>

It does not work if bias has shape [M, 1] or [M, N]. But OCL4DNNInnerProduct is only used in InnerProduct layer in fully_connected_layer.cpp for now.


I will open another pull request adding opencl backend implementation for Gemm layer in gemm_layer.cpp.

@fengyuentau
Copy link
Member Author

fengyuentau commented Jan 11, 2024

Ensure that correct OpenCL device is selected (e.g. using OPENCV_OPENCL_DEVICE=":GPU:1"):

[ INFO:[email protected]] global ocl.cpp:1185 haveOpenCL Initialize OpenCL runtime...
[ INFO:[email protected]] global ocl.cpp:1191 haveOpenCL OpenCL: found 3 platforms
[ INFO:[email protected]] global ocl.cpp:983 getInitializedExecutionContext OpenCL: initializing thread execution context
[ INFO:[email protected]] global ocl.cpp:993 getInitializedExecutionContext OpenCL: creating new execution context...
[ INFO:[email protected]] global ocl.cpp:1011 getInitializedExecutionContext OpenCL: device=Intel(R) UHD Graphics 770
CTEST_FULL_OUTPUT
OpenCV version: 4.9.0-dev
OpenCV VCS version: 4.9.0-25-g8a950db4e9-dirty
Build type: Debug
Compiler: /usr/lib64/ccache/c++  (ver 13.2.1)
[ INFO:[email protected]] global registry_parallel.impl.hpp:96 ParallelBackendRegistry core(parallel): Enabled backends(3, sorted by priority): ONETBB(1000); TBB(990); OPENMP(980)
Parallel framework: pthreads (nthreads=2)
CPU features: SSE SSE2 SSE3 *SSE4.1 *SSE4.2 *FP16 *AVX *AVX2 *AVX512-SKX?
Intel(R) IPP version: ippIP AVX2 (l9) 2021.10.0 (-) Sep 18 2023
Intel(R) IPP features code: 0x8000
OpenCL Platforms: 
    AMD Accelerated Parallel Processing
        dGPU: gfx1031 (OpenCL 2.0 )
    Intel(R) OpenCL Graphics
        iGPU: Intel(R) UHD Graphics 770 (OpenCL 3.0 NEO )
    Portable Computing Language
        CPU: cpu-12th Gen Intel(R) Core(TM) i7-12700K (OpenCL 3.0 PoCL HSTR: cpu-x86_64-redhat-linux-gnu-alderlake)
Current OpenCL device: 
    Type = iGPU
    Name = Intel(R) UHD Graphics 770
    Version = OpenCL 3.0 NEO 
    Driver version = 23.35.27191.9
    Address bits = 64
    Compute units = 32
    Max work group size = 512
    Local memory size = 64 KB
    Max memory allocation size = 3 GB 1023 MB 1016 KB
    Double support = No
    Half support = Yes
    Host unified memory = Yes
    Device extensions:
        cl_khr_byte_addressable_store
        cl_khr_device_uuid
        cl_khr_fp16
        cl_khr_global_int32_base_atomics
        cl_khr_global_int32_extended_atomics
        cl_khr_icd
        cl_khr_local_int32_base_atomics
        cl_khr_local_int32_extended_atomics
        cl_intel_command_queue_families
        cl_intel_subgroups
        cl_intel_required_subgroup_size
        cl_intel_subgroups_short
        cl_khr_spir
        cl_intel_accelerator
        cl_intel_driver_diagnostics
        cl_khr_priority_hints
        cl_khr_throttle_hints
        cl_khr_create_command_queue
        cl_intel_subgroups_char
        cl_intel_subgroups_long
        cl_khr_il_program
        cl_intel_mem_force_host_memory
        cl_khr_subgroup_extended_types
        cl_khr_subgroup_non_uniform_vote
        cl_khr_subgroup_ballot
        cl_khr_subgroup_non_uniform_arithmetic
        cl_khr_subgroup_shuffle
        cl_khr_subgroup_shuffle_relative
        cl_khr_subgroup_clustered_reduce
        cl_intel_device_attribute_query
        cl_khr_suggested_local_work_size
        cl_intel_split_work_group_barrier
        cl_intel_spirv_media_block_io
        cl_intel_spirv_subgroups
        cl_khr_spirv_linkonce_odr
        cl_khr_spirv_no_integer_wrap_decoration
        cl_intel_unified_shared_memory
        cl_khr_mipmap_image
        cl_khr_mipmap_image_writes
        cl_ext_float_atomics
        cl_khr_external_memory
        cl_intel_planar_yuv
        cl_intel_packed_yuv
        cl_khr_int64_base_atomics
        cl_khr_int64_extended_atomics
        cl_khr_image2d_from_buffer
        cl_khr_depth_images
        cl_khr_3d_image_writes
        cl_intel_media_block_io
        cl_intel_subgroup_local_block_io
        cl_khr_integer_dot_product
        cl_khr_gl_sharing
        cl_khr_gl_depth_images
        cl_khr_gl_event
        cl_khr_gl_msaa_sharing
        cl_intel_va_api_media_sharing
        cl_intel_sharing_format_query
        cl_khr_pci_bus_info
    Has AMD Blas = No
    Has AMD Fft = No
    Preferred vector width char = 16
    Preferred vector width short = 8
    Preferred vector width int = 4
    Preferred vector width long = 1
    Preferred vector width float = 1
    Preferred vector width double = 0
    Preferred vector width half = 8

Also ensure to use Intel compute runtime: https://github.com/intel/compute-runtime/releases Alternative guidelines: https://dgpu-docs.intel.com/driver/installation.html

P.S. Avoid using screenshots with text information.

I followed these steps and installed all these packages a while ago, it works with the integrated GPU in i7-12700K previously. Now it does not work anymore since I installed GTX 1080Ti in the system with CUDA 12.

I followed the same steps re-installing everything, but still it does not work. Also looks like OCL_FP16 is not working correctly with OpenCL 3.0 CUDA. OpenCL 3.0 CUDA (and Apple OpenCL) still does not support ocl_khr_fp16.

@fengyuentau
Copy link
Member Author

@opencv-alalek Problem solved. It was because iGPU is automatically disabled by BIOS when a discrete GPU (NVIDIA GTX 1080Ti in my case) is installed. Solution is enable iGPU in BIOS.

@asmorkalov asmorkalov merged commit 97c418a into opencv:4.x Jan 12, 2024
@fengyuentau fengyuentau deleted the ocl_innerproduct branch January 12, 2024 12:12
This was referenced Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants