BlockRadixRankMatch produces invalid results when warp size does not divide block size

As the title suggests, when the device warp size does not divide the block size exactly `BlockRadixRankMatch` may produce invalid results. This seems to be because this algorithm uses warp-level instructions which do not take the actual launch bounds into account. In specific, this call to the match.any emulation also returns set bits for lanes that do not participate in the warp:

https://github.com/NVIDIA/cub/blob/5571258c6451340e212ba2576eab28fd63cd0fcf/cub/block/block_radix_rank.cuh#L701

This code reproduces the bug for me, on both Titan V and RTX 3090:
```c++
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_rank.cuh>
#include <cub/block/radix_rank_sort_operations.cuh>

#include <vector>
#include <ostream>

template<unsigned block_size, unsigned items_per_thread>
__global__ __launch_bounds__(block_size) void kernel(const unsigned* keys, int* ranks) {
    constexpr unsigned items_per_block = block_size * items_per_thread;
    const unsigned tid = threadIdx.x;
    const unsigned block_offset = blockIdx.x * items_per_block;

    unsigned thread_keys[items_per_thread];
    cub::LoadDirectWarpStriped(tid, keys + block_offset, thread_keys);

    cub::BFEDigitExtractor<unsigned> digit_extractor(0, 5);
    int thread_ranks[items_per_thread];

    using Ranker = cub::BlockRadixRankMatch<block_size, 5, false>;
    __shared__ typename Ranker::TempStorage storage;

    Ranker ranker(storage);
    ranker.RankKeys(thread_keys, thread_ranks, digit_extractor);

    cub::StoreDirectWarpStriped(tid, ranks + block_offset, thread_ranks);
}

int main() {
    constexpr unsigned size = 2; // Not a multiple of the warp size.
    std::vector<unsigned> keys = {0, 1};

    unsigned* d_keys;
    cudaMalloc(&d_keys, size * sizeof(unsigned));

    int* d_ranks;
    cudaMalloc(&d_ranks, size * sizeof(int));

    cudaMemcpy(d_keys, keys.data(), size * sizeof(unsigned), cudaMemcpyHostToDevice);

    (kernel<size, 1>)<<<1, size>>>(d_keys, d_ranks);

    cudaDeviceSynchronize();

    std::vector<int> ranks(size);
    cudaMemcpy(ranks.data(), d_ranks, size * sizeof(int), cudaMemcpyDeviceToHost);

    for (int i = 0; i < size; ++i) {
        std::cout << "[" << i << "] " << keys[i] << " expected=" << i << " actual=" << ranks[i] << std::endl;
    }

    cudaFree(d_keys);
    cudaFree(d_ranks);
}
```
output:
```
[0] 0 expected=0 actual=0
[1] 1 expected=1 actual=31
```
(Note that since this includes undefined data im not sure if the above always reproduces it.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BlockRadixRankMatch produces invalid results when warp size does not divide block size #552

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BlockRadixRankMatch produces invalid results when warp size does not divide block size #552

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions