Thanks to visit codestin.com
Credit goes to github.com

Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

BlockRadixRankMatch produces invalid results when warp size does not divide block size #552

@Snektron

Description

@Snektron

As the title suggests, when the device warp size does not divide the block size exactly BlockRadixRankMatch may produce invalid results. This seems to be because this algorithm uses warp-level instructions which do not take the actual launch bounds into account. In specific, this call to the match.any emulation also returns set bits for lanes that do not participate in the warp:

uint32_t peer_mask = MatchAny<RADIX_BITS>(digit);

This code reproduces the bug for me, on both Titan V and RTX 3090:

#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_rank.cuh>
#include <cub/block/radix_rank_sort_operations.cuh>

#include <vector>
#include <ostream>

template<unsigned block_size, unsigned items_per_thread>
__global__ __launch_bounds__(block_size) void kernel(const unsigned* keys, int* ranks) {
    constexpr unsigned items_per_block = block_size * items_per_thread;
    const unsigned tid = threadIdx.x;
    const unsigned block_offset = blockIdx.x * items_per_block;

    unsigned thread_keys[items_per_thread];
    cub::LoadDirectWarpStriped(tid, keys + block_offset, thread_keys);

    cub::BFEDigitExtractor<unsigned> digit_extractor(0, 5);
    int thread_ranks[items_per_thread];

    using Ranker = cub::BlockRadixRankMatch<block_size, 5, false>;
    __shared__ typename Ranker::TempStorage storage;

    Ranker ranker(storage);
    ranker.RankKeys(thread_keys, thread_ranks, digit_extractor);

    cub::StoreDirectWarpStriped(tid, ranks + block_offset, thread_ranks);
}

int main() {
    constexpr unsigned size = 2; // Not a multiple of the warp size.
    std::vector<unsigned> keys = {0, 1};

    unsigned* d_keys;
    cudaMalloc(&d_keys, size * sizeof(unsigned));

    int* d_ranks;
    cudaMalloc(&d_ranks, size * sizeof(int));

    cudaMemcpy(d_keys, keys.data(), size * sizeof(unsigned), cudaMemcpyHostToDevice);

    (kernel<size, 1>)<<<1, size>>>(d_keys, d_ranks);

    cudaDeviceSynchronize();

    std::vector<int> ranks(size);
    cudaMemcpy(ranks.data(), d_ranks, size * sizeof(int), cudaMemcpyDeviceToHost);

    for (int i = 0; i < size; ++i) {
        std::cout << "[" << i << "] " << keys[i] << " expected=" << i << " actual=" << ranks[i] << std::endl;
    }

    cudaFree(d_keys);
    cudaFree(d_ranks);
}

output:

[0] 0 expected=0 actual=0
[1] 1 expected=1 actual=31

(Note that since this includes undefined data im not sure if the above always reproduces it.)

Metadata

Metadata

Assignees

Labels

P0: must haveAbsolutely necessary. Critical issue, major blocker, etc.type: bug: functionalDoes not work as intended.

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions