Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Optimize DeviceSegmentedRadixSort #879

@gevtushenko

Description

@gevtushenko

During the development of the new segmented sort, I extracted an AgentSegmentedRadixSort class. It's mostly based on the existing DeviceSegmentedRadixSort implementation. The only differences are:

  1. while (current_bit < end_bit) loop is moved from the host to the device side.
  2. if the segment data fit into shared memory, BlockRadixSort is used.

The combination of these changes gives about 6x speedup on RTX3090 and up to 7x on RTX2080 for segments with up to 5k elements. Unfortunately, the case of large segments is also affected. Since the new code requires a different number of registers, the speedup/slowdown is unpredictable. For some input data types/segment sizes, I got about 14% improvement. In few cases, I've noticed a 40% slowdown. Although the median speedup was around 0.996, more research is required.

When the slowdowns of the large segments sorting are addressed, we should use AgentSegmentedRadixSort as the DeviceSegmentedRadixSort implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cubFor all items related to CUB

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions