Optimize DeviceSegmentedRadixSort

During the development of the [new segmented sort](https://github.com/NVIDIA/cub/pull/357), I extracted an `AgentSegmentedRadixSort` class. It's mostly based on the existing `DeviceSegmentedRadixSort` implementation. The only differences are:
1) `while (current_bit < end_bit)` loop is moved from the host to the device side. 
2) if the segment data fit into shared memory, `BlockRadixSort` is used.

The combination of these changes gives about 6x speedup on RTX3090 and up to 7x on RTX2080 for segments with up to 5k elements. Unfortunately, the case of large segments is also affected. Since the new code requires a different number of registers, the speedup/slowdown is unpredictable. For some input data types/segment sizes, I got about 14% improvement. In few cases, I've noticed a 40% slowdown. Although the median speedup was around 0.996, more research is required. 

When the slowdowns of the large segments sorting are addressed, we should use `AgentSegmentedRadixSort` as the `DeviceSegmentedRadixSort` implementation. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize DeviceSegmentedRadixSort #879

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize DeviceSegmentedRadixSort #879

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions