Enable fp16 nonnative support for dynamic dispatch, make more ergonomic for static dispatch #200

sterrettm2 · 2025-04-24T21:36:22Z

This patch enabled the non-avx512fp16 _Float16 sorting to be used by the dynamic dispatch logic, as well as integrating it better into the static dispatch logic.
It is vastly faster than scalar, but a fair bit slower then the dedicated avx512fp16 code.

Comparison to scalar

Benchmark                                                       Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9264         -0.9269          6368           468          6373           466
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9399         -0.9394         13394           804         13401           813
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9412         -0.9410         29552          1737         29560          1745
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9209         -0.9208         75451          5967         75463          5975
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9396         -0.9396        590828         35676        590792         35680
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9524         -0.9524      14506782        689933      14506540        689943
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9616         -0.9616     159229801       6113740     159217432       6113529
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9827         -0.9827    1739044872      30113868    1738990462      30113349
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9864         -0.9864   19075697953     259558909   19074535929     259512766
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9210         -0.9209          5579           441          5582           442
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9388         -0.9382         13171           806         13176           815
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9377         -0.9372         28020          1746         28025          1761
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9328         -0.9326         74876          5029         74879          5045
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9469         -0.9469        585484         31094        585483         31107
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9510         -0.9510      14291316        700636      14290814        700538
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9608         -0.9608     156622769       6146373     156621600       6145706
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9826         -0.9826    1729001307      30128303    1728922542      30127689
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9210         -0.9210        803746         63496        803743         63504
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9989         -0.9989        621695           656        621626           654
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9131         -0.9130        742937         64595        742908         64605
[scalarsort.*_Float16 vs. simdsort.*_Float16]_pvalue          0.0315          0.0315      U Test, Repetitions: 20 vs 20
OVERALL_GEOMEAN                                              -0.9595         -0.9595             0             0             0             0

Comparison to AVX512_FP16

Benchmark                                                         Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.6382         +0.6557           269           441           268           443
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.3411         +0.3500           604           810           605           816
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.7145         +0.7192          1014          1739          1017          1748
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.7644         +1.7643          2153          5951          2156          5959
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.8787         +1.8783         12278         35344         12282         35351
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.7758         +0.7759        385721        684977        385712        684986
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.3700         +0.3698       1368168       1874386       1368250       1874260
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.4479         +0.4479       5137910       7438989       5137603       7438745
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.1682         +0.1652      77774693      90857847      77385351      90173137
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.6533         +0.6521           267           441           268           442
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.3211         +0.3280           612           809           613           814
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.5179         +0.5210          1141          1732          1143          1738
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.6678         +1.6679          2200          5868          2202          5876
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.5668         +1.5672         12788         32824         12788         32830
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.7579         +0.7581        384179        675344        384134        675354
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.2080         +0.2081       1552580       1875503       1552298       1875373
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.4877         +0.4877       5052821       7516986       5052579       7516805
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.4458         +1.4462         25727         62922         25727         62934
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.5392         +0.5407           445           685           445           686
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.4454         +1.4455         25703         62854         25705         62862
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]_pvalue          0.5075          0.5075      U Test, Repetitions: 20 vs 20
OVERALL_GEOMEAN                                                +0.7601         +0.7624             0             0             0             0

r-devulap

LGTM. The benchmark numbers on TGL shows significant improvement for _Float16 and no regression on any other datatype.

r-devulap · 2025-04-29T20:34:49Z

src/avx512-16bit-qsort.hpp

        }
-        replace_inf_with_nan(arr, arrsize, nan_count, descending);
+        replace_inf_with_nan_fp16(


Why can't we continue to use replace_inf_with_nan?

The only difference between the two functions seems to be 0x7c01 v/s 0xFFFF and 0xFFFF seems to fail tests.

I got rid of replace_inf_with_nan_fp16 by modifying replace_inf_with_nan to use 0x7c01 instead of 0xFFFF.

r-devulap · 2025-04-29T20:38:30Z

.github/workflows/c-cpp.yml

@@ -208,6 +208,9 @@ jobs:

    - name: Run test suite on SPR
      run: sde -spr -- ./builddir/testexe
+    - name: Run ICL fp16 tests
+      # Note: This filters for the _Float16 tests based on the number assigned to it, which could change in the future
+      run: sde -icx -- ./builddir/testexe --gtest_filter="*/simdsort/2*"


The np-multiarray-tgl job does test the float16 portion of the code on a TGL, but this is fine too.

r-devulap

LGTM. Thanks @sterrettm2 !

sterrettm2 and others added 8 commits April 24, 2025 14:28

Enable fp16 sorting without AVX512FP16

e0103be

Fix dispatch logic to use both ICL and SPR fp16

bbb7906

Add ASAN testing for the ICL fp16 code

8989221

formatting

2301a0b

Fix builds by avoiding _Float16

1faa106

Use macros to reduce duplicated code

9396adf

Format code

c710732

Workaround to avoid using an #ifdef block inside a C macro

74734d2

r-devulap reviewed Apr 29, 2025

View reviewed changes

r-devulap force-pushed the fp16_nonnative branch from 7547df8 to 2c39de4 Compare April 30, 2025 04:24

Get rid of replace_inf_with_nan_fp16

2c39de4

r-devulap approved these changes Apr 30, 2025

View reviewed changes

r-devulap merged commit 724e92e into intel:main Apr 30, 2025
20 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable fp16 nonnative support for dynamic dispatch, make more ergonomic for static dispatch #200

Enable fp16 nonnative support for dynamic dispatch, make more ergonomic for static dispatch #200

sterrettm2 commented Apr 24, 2025

r-devulap left a comment

r-devulap Apr 29, 2025

r-devulap Apr 29, 2025

r-devulap Apr 30, 2025

r-devulap Apr 29, 2025

r-devulap left a comment

Enable fp16 nonnative support for dynamic dispatch, make more ergonomic for static dispatch #200

Enable fp16 nonnative support for dynamic dispatch, make more ergonomic for static dispatch #200

Conversation

sterrettm2 commented Apr 24, 2025

r-devulap left a comment

Choose a reason for hiding this comment

r-devulap Apr 29, 2025

Choose a reason for hiding this comment

r-devulap Apr 29, 2025

Choose a reason for hiding this comment

r-devulap Apr 30, 2025

Choose a reason for hiding this comment

r-devulap Apr 29, 2025

Choose a reason for hiding this comment

r-devulap left a comment

Choose a reason for hiding this comment