-
Notifications
You must be signed in to change notification settings - Fork 282
Description
During the development of the new segmented sort, I extracted an AgentSegmentedRadixSort class. It's mostly based on the existing DeviceSegmentedRadixSort implementation. The only differences are:
while (current_bit < end_bit)loop is moved from the host to the device side.- if the segment data fit into shared memory,
BlockRadixSortis used.
The combination of these changes gives about 6x speedup on RTX3090 and up to 7x on RTX2080 for segments with up to 5k elements. Unfortunately, the case of large segments is also affected. Since the new code requires a different number of registers, the speedup/slowdown is unpredictable. For some input data types/segment sizes, I got about 14% improvement. In few cases, I've noticed a 40% slowdown. Although the median speedup was around 0.996, more research is required.
When the slowdowns of the large segments sorting are addressed, we should use AgentSegmentedRadixSort as the DeviceSegmentedRadixSort implementation.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status