Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@yimingc
Copy link
Contributor

@yimingc yimingc commented Aug 13, 2025

Key Changes:

  1. Removed Global Serialization Lock for CUDA

    • Eliminated sift_match_gpu_mutexes_ that blocked all CUDA matching operations
    • CUDA version now runs completely lock-free during compute operations
  2. Explicit CUDA Initialization in Worker Thread Level using cudaSetDevice()

    • Each worker thread initializes its own CUDA context independently
    • Eliminates need for complex per-instance initialization logic
  3. Improved Variable Naming for Clarity

    • Renamed sift_match_gpu_mutexes_ to sift_opengl_mutexes_
    • Updated comments to clarify OpenGL-specific mutex usage
    • Removed misleading references to "all GPU implementations"

Why Can We Remove The Global Mutex?

  • CUDA Context Lifecycle: CUDA context lifecycle is automatically managed by the driver layer, allowing multiple threads to safely and independently allocate memory, create streams, and execute kernels on the same device. link

  • Thread-Safe CUDA Operations: Common operations including memory allocation (cudaMalloc), data copying (cudaMemcpy), and stream creation/switching (cudaStreamCreate) are inherently thread-safe at the CUDA Runtime API level. link

  • Per-Thread Default Stream Isolation: Under PTDS mode, each thread gets its own isolated default CUDA stream, eliminating the need for explicit synchronization between threads.

  • No Static Variable Dependencies: Analysis of SiftGPU/SiftMatchCU code confirmed that the matching operations do not rely on shared static variables that would require serialization protection.

Why Do We Add SetBestCudaDevice(gpu_index); in FeatureMatcherWorker::Run()?

Short answer: cudaSetDevice(gpu_index) will be called inside and we need it to bind CUDA context explicitly.
Thread-Local Device Context Requirements:

  • Per-Thread Device Binding: CUDA device selection is thread-local state. Each worker thread must explicitly set its target GPU device before performing any CUDA operations.

  • Multi-GPU Environment Support: In systems with multiple GPUs, different worker threads may be assigned to different devices. The cudaSetDevice() call ensures each thread operates on its designated GPU.

  • Early Initialization Timing: By setting the device at the beginning of Run(), we guarantee that all subsequent CUDA operations (SiftGPU initialization, memory allocation, kernel execution) occur on the correct device.

  • Context Warm-up: The cudaFree(0) call immediately after cudaSetDevice() serves as a context warm-up operation, ensuring the CUDA context is fully initialized before the worker begins processing.

🔒 Thread Safety Guarantees

  • Initialization Phase: Protected by per-GPU mutexes during lazy setup
  • Compute Phase: Lock-free parallel execution with dedicated CUDA streams
  • Instance Isolation: Each thread operates on independent matcher instances
  • Stream Isolation: PTDS ensures each thread's CUDA operations are automatically isolated

🚀 Benefits

  1. Eliminates Serialization Bottleneck: Multiple threads can submit kernels concurrently
  2. Maximizes GPU Utilization: True parallelism with PTDS integration
  3. Reduces CPU Overhead: No lock contention in compute-heavy operations
  4. Maintains Safety: Thread-safe initialization with zero runtime locking cost
  5. Cleaner Architecture: Separation of initialization and compute concerns

Example Use Case

When running under PTDS (Per-Thread Default Stream), each worker thread’s “default stream” is already isolated. Combined with this lock-removal, we achieve fully asynchronous, multi-threaded descriptor generation and matching—maximizing both CPU and GPU throughput.

@ahojnnes
Copy link
Contributor

Thanks @yimingc for upstreaming these changes.

@ahojnnes ahojnnes enabled auto-merge (squash) August 14, 2025 08:10
@ahojnnes ahojnnes merged commit fc0afc1 into colmap:main Aug 14, 2025
13 checks passed
tavislocus pushed a commit to tavislocus/colmap_6dof that referenced this pull request Aug 19, 2025
…olmap#3561)

### **Key Changes:**

1.  **Removed Global Serialization Lock for CUDA**

* Eliminated `sift_match_gpu_mutexes_` that blocked all CUDA matching
operations
* CUDA version now runs completely lock-free during compute operations

2. **Explicit CUDA Initialization in Worker Thread Level using
`cudaSetDevice()`**

* Each worker thread initializes its own CUDA context independently
    *   Eliminates need for complex per-instance initialization logic

3. **Improved Variable Naming for Clarity**

    *   Renamed `sift_match_gpu_mutexes_` to `sift_opengl_mutexes_`
    *   Updated comments to clarify OpenGL-specific mutex usage
    *   Removed misleading references to "all GPU implementations"

#### Why Can We Remove The Global Mutex?
* **CUDA Context Lifecycle**: CUDA context lifecycle is automatically
managed by the driver layer, allowing multiple threads to safely and
independently allocate memory, create streams, and execute kernels on
the same device.
[link](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization)

* **Thread-Safe CUDA Operations**: Common operations including memory
allocation (`cudaMalloc`), data copying (`cudaMemcpy`), and stream
creation/switching (`cudaStreamCreate`) are inherently thread-safe at
the CUDA Runtime API level.
[link](https://forums.developer.nvidia.com/t/cudahostregister-on-multiple-threads/296497?utm_source=chatgpt.com)

* **Per-Thread Default Stream Isolation**: Under PTDS mode, each thread
gets its own isolated default CUDA stream, eliminating the need for
explicit synchronization between threads.

* **No Static Variable Dependencies**: Analysis of SiftGPU/SiftMatchCU
code confirmed that the matching operations do not rely on shared static
variables that would require serialization protection.

#### **Why Do We Add `SetBestCudaDevice(gpu_index);` in
`FeatureMatcherWorker::Run()`?**
Short answer: `cudaSetDevice(gpu_index)` will be called inside and we
need it to bind CUDA context explicitly.
**Thread-Local Device Context Requirements:**

* **Per-Thread Device Binding**: CUDA device selection is thread-local
state. Each worker thread must explicitly set its target GPU device
before performing any CUDA operations.

* **Multi-GPU Environment Support**: In systems with multiple GPUs,
different worker threads may be assigned to different devices. The
`cudaSetDevice()` call ensures each thread operates on its designated
GPU.

* **Early Initialization Timing**: By setting the device at the
beginning of `Run()`, we guarantee that all subsequent CUDA operations
(SiftGPU initialization, memory allocation, kernel execution) occur on
the correct device.

* **Context Warm-up**: The `cudaFree(0)` call immediately after
`cudaSetDevice()` serves as a context warm-up operation, ensuring the
CUDA context is fully initialized before the worker begins processing.

### 🔒 **Thread Safety Guarantees**

* **Initialization Phase**: Protected by per-GPU mutexes during lazy
setup
* **Compute Phase**: Lock-free parallel execution with dedicated CUDA
streams
* **Instance Isolation**: Each thread operates on independent matcher
instances
* **Stream Isolation**: PTDS ensures each thread's CUDA operations are
automatically isolated

### 🚀 **Benefits**

1. **Eliminates Serialization Bottleneck**: Multiple threads can submit
kernels concurrently
2. **Maximizes GPU Utilization**: True parallelism with PTDS integration
3. **Reduces CPU Overhead**: No lock contention in compute-heavy
operations
4. **Maintains Safety**: Thread-safe initialization with zero runtime
locking cost
5. **Cleaner Architecture**: Separation of initialization and compute
concerns

### Example Use Case
When running under PTDS (Per-Thread Default Stream), each worker
thread’s “default stream” is already isolated. Combined with this
lock-removal, we achieve fully asynchronous, multi-threaded descriptor
generation and matching—maximizing both CPU and GPU throughput.

---------

Co-authored-by: Yiming Chen <[email protected]>
Co-authored-by: Johannes Schönberger <[email protected]>
Co-authored-by: Johannes Schönberger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants