Removing the Global Lock in SiftGPUFeatureMatcher for CUDA backend #3561
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Key Changes:
Removed Global Serialization Lock for CUDA
sift_match_gpu_mutexes_that blocked all CUDA matching operationsExplicit CUDA Initialization in Worker Thread Level using
cudaSetDevice()Improved Variable Naming for Clarity
sift_match_gpu_mutexes_tosift_opengl_mutexes_Why Can We Remove The Global Mutex?
CUDA Context Lifecycle: CUDA context lifecycle is automatically managed by the driver layer, allowing multiple threads to safely and independently allocate memory, create streams, and execute kernels on the same device. link
Thread-Safe CUDA Operations: Common operations including memory allocation (
cudaMalloc), data copying (cudaMemcpy), and stream creation/switching (cudaStreamCreate) are inherently thread-safe at the CUDA Runtime API level. linkPer-Thread Default Stream Isolation: Under PTDS mode, each thread gets its own isolated default CUDA stream, eliminating the need for explicit synchronization between threads.
No Static Variable Dependencies: Analysis of SiftGPU/SiftMatchCU code confirmed that the matching operations do not rely on shared static variables that would require serialization protection.
Why Do We Add
SetBestCudaDevice(gpu_index);inFeatureMatcherWorker::Run()?Short answer:
cudaSetDevice(gpu_index)will be called inside and we need it to bind CUDA context explicitly.Thread-Local Device Context Requirements:
Per-Thread Device Binding: CUDA device selection is thread-local state. Each worker thread must explicitly set its target GPU device before performing any CUDA operations.
Multi-GPU Environment Support: In systems with multiple GPUs, different worker threads may be assigned to different devices. The
cudaSetDevice()call ensures each thread operates on its designated GPU.Early Initialization Timing: By setting the device at the beginning of
Run(), we guarantee that all subsequent CUDA operations (SiftGPU initialization, memory allocation, kernel execution) occur on the correct device.Context Warm-up: The
cudaFree(0)call immediately aftercudaSetDevice()serves as a context warm-up operation, ensuring the CUDA context is fully initialized before the worker begins processing.🔒 Thread Safety Guarantees
🚀 Benefits
Example Use Case
When running under PTDS (Per-Thread Default Stream), each worker thread’s “default stream” is already isolated. Combined with this lock-removal, we achieve fully asynchronous, multi-threaded descriptor generation and matching—maximizing both CPU and GPU throughput.