This repository was archived by the owner on Mar 21, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 463
This repository was archived by the owner on Mar 21, 2024. It is now read-only.
BlockRadixRankMatch produces invalid results when warp size does not divide block size #552
Copy link
Copy link
Closed
Labels
P0: must haveAbsolutely necessary. Critical issue, major blocker, etc.Absolutely necessary. Critical issue, major blocker, etc.type: bug: functionalDoes not work as intended.Does not work as intended.
Milestone
Description
As the title suggests, when the device warp size does not divide the block size exactly BlockRadixRankMatch may produce invalid results. This seems to be because this algorithm uses warp-level instructions which do not take the actual launch bounds into account. In specific, this call to the match.any emulation also returns set bits for lanes that do not participate in the warp:
cub/cub/block/block_radix_rank.cuh
Line 701 in 5571258
| uint32_t peer_mask = MatchAny<RADIX_BITS>(digit); |
This code reproduces the bug for me, on both Titan V and RTX 3090:
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_rank.cuh>
#include <cub/block/radix_rank_sort_operations.cuh>
#include <vector>
#include <ostream>
template<unsigned block_size, unsigned items_per_thread>
__global__ __launch_bounds__(block_size) void kernel(const unsigned* keys, int* ranks) {
constexpr unsigned items_per_block = block_size * items_per_thread;
const unsigned tid = threadIdx.x;
const unsigned block_offset = blockIdx.x * items_per_block;
unsigned thread_keys[items_per_thread];
cub::LoadDirectWarpStriped(tid, keys + block_offset, thread_keys);
cub::BFEDigitExtractor<unsigned> digit_extractor(0, 5);
int thread_ranks[items_per_thread];
using Ranker = cub::BlockRadixRankMatch<block_size, 5, false>;
__shared__ typename Ranker::TempStorage storage;
Ranker ranker(storage);
ranker.RankKeys(thread_keys, thread_ranks, digit_extractor);
cub::StoreDirectWarpStriped(tid, ranks + block_offset, thread_ranks);
}
int main() {
constexpr unsigned size = 2; // Not a multiple of the warp size.
std::vector<unsigned> keys = {0, 1};
unsigned* d_keys;
cudaMalloc(&d_keys, size * sizeof(unsigned));
int* d_ranks;
cudaMalloc(&d_ranks, size * sizeof(int));
cudaMemcpy(d_keys, keys.data(), size * sizeof(unsigned), cudaMemcpyHostToDevice);
(kernel<size, 1>)<<<1, size>>>(d_keys, d_ranks);
cudaDeviceSynchronize();
std::vector<int> ranks(size);
cudaMemcpy(ranks.data(), d_ranks, size * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < size; ++i) {
std::cout << "[" << i << "] " << keys[i] << " expected=" << i << " actual=" << ranks[i] << std::endl;
}
cudaFree(d_keys);
cudaFree(d_ranks);
}output:
[0] 0 expected=0 actual=0
[1] 1 expected=1 actual=31
(Note that since this includes undefined data im not sure if the above always reproduces it.)
Metadata
Metadata
Assignees
Labels
P0: must haveAbsolutely necessary. Critical issue, major blocker, etc.Absolutely necessary. Critical issue, major blocker, etc.type: bug: functionalDoes not work as intended.Does not work as intended.
Type
Projects
Status
Done