Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Enable fp16 nonnative support for dynamic dispatch, make more ergonomic for static dispatch #200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 30, 2025

Conversation

sterrettm2
Copy link
Contributor

This patch enabled the non-avx512fp16 _Float16 sorting to be used by the dynamic dispatch logic, as well as integrating it better into the static dispatch logic.
It is vastly faster than scalar, but a fair bit slower then the dedicated avx512fp16 code.

Comparison to scalar
Benchmark                                                       Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9264         -0.9269          6368           468          6373           466
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9399         -0.9394         13394           804         13401           813
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9412         -0.9410         29552          1737         29560          1745
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9209         -0.9208         75451          5967         75463          5975
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9396         -0.9396        590828         35676        590792         35680
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9524         -0.9524      14506782        689933      14506540        689943
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9616         -0.9616     159229801       6113740     159217432       6113529
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9827         -0.9827    1739044872      30113868    1738990462      30113349
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9864         -0.9864   19075697953     259558909   19074535929     259512766
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9210         -0.9209          5579           441          5582           442
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9388         -0.9382         13171           806         13176           815
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9377         -0.9372         28020          1746         28025          1761
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9328         -0.9326         74876          5029         74879          5045
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9469         -0.9469        585484         31094        585483         31107
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9510         -0.9510      14291316        700636      14290814        700538
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9608         -0.9608     156622769       6146373     156621600       6145706
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9826         -0.9826    1729001307      30128303    1728922542      30127689
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9210         -0.9210        803746         63496        803743         63504
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9989         -0.9989        621695           656        621626           654
[scalarsort.*_Float16 vs. simdsort.*_Float16]                -0.9131         -0.9130        742937         64595        742908         64605
[scalarsort.*_Float16 vs. simdsort.*_Float16]_pvalue          0.0315          0.0315      U Test, Repetitions: 20 vs 20
OVERALL_GEOMEAN                                              -0.9595         -0.9595             0             0             0             0

Comparison to AVX512_FP16
Benchmark                                                         Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.6382         +0.6557           269           441           268           443
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.3411         +0.3500           604           810           605           816
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.7145         +0.7192          1014          1739          1017          1748
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.7644         +1.7643          2153          5951          2156          5959
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.8787         +1.8783         12278         35344         12282         35351
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.7758         +0.7759        385721        684977        385712        684986
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.3700         +0.3698       1368168       1874386       1368250       1874260
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.4479         +0.4479       5137910       7438989       5137603       7438745
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.1682         +0.1652      77774693      90857847      77385351      90173137
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.6533         +0.6521           267           441           268           442
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.3211         +0.3280           612           809           613           814
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.5179         +0.5210          1141          1732          1143          1738
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.6678         +1.6679          2200          5868          2202          5876
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.5668         +1.5672         12788         32824         12788         32830
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.7579         +0.7581        384179        675344        384134        675354
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.2080         +0.2081       1552580       1875503       1552298       1875373
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.4877         +0.4877       5052821       7516986       5052579       7516805
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.4458         +1.4462         25727         62922         25727         62934
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +0.5392         +0.5407           445           685           445           686
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]                +1.4454         +1.4455         25703         62854         25705         62862
[.*simdsort.*_Float16 vs. .*simdsort.*_Float16]_pvalue          0.5075          0.5075      U Test, Repetitions: 20 vs 20
OVERALL_GEOMEAN                                                +0.7601         +0.7624             0             0             0             0

Copy link
Contributor

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The benchmark numbers on TGL shows significant improvement for _Float16 and no regression on any other datatype.

}
replace_inf_with_nan(arr, arrsize, nan_count, descending);
replace_inf_with_nan_fp16(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we continue to use replace_inf_with_nan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference between the two functions seems to be 0x7c01 v/s 0xFFFF and 0xFFFF seems to fail tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of replace_inf_with_nan_fp16 by modifying replace_inf_with_nan to use 0x7c01 instead of 0xFFFF.

@@ -208,6 +208,9 @@ jobs:

- name: Run test suite on SPR
run: sde -spr -- ./builddir/testexe
- name: Run ICL fp16 tests
# Note: This filters for the _Float16 tests based on the number assigned to it, which could change in the future
run: sde -icx -- ./builddir/testexe --gtest_filter="*/simdsort/2*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The np-multiarray-tgl job does test the float16 portion of the code on a TGL, but this is fine too.

Copy link
Contributor

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @sterrettm2 !

@r-devulap r-devulap merged commit 724e92e into intel:main Apr 30, 2025
20 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants