Vector algorithms with AVX2 masked stores and AMD processors

The benchmark results https://github.com/microsoft/STL/pull/5062#issuecomment-2460623213 seem to confirm that #5062 is a pessimization for AMD. 

AVX2 mask store [timings](https://uops.info/table.html?search=maskmov%20M256%2C&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_ICL=on&cb_ADLP=on&cb_ADLE=on&cb_ZENp=on&cb_ZEN4=on&cb_measurements=on&cb_base=on&cb_avx=on&cb_avx2=on) are bad on recent AMDs.

In addition to the currently in review algorithm, we have one accepted already.

Questions:
 * Should #5062 be closed? Or optimizing one vendor somewhat higher than pessimizing the other is still fine?
 * Should #4554 be reevaluated on an AMD? (can use `#define _USE_STD_VECTOR_ALGORITHMS 0` escape to simulate the "before" state). It is less likely that it makes things worse, as the vectorization advantage was bigger there.
 * Is this right that we don't do vendor detection using `cpuid` instruction?

----

Note that we also use masked loads, but I don't have concerns for them:
 * They are bad only on AMDs before Zen 2, see [timings](https://uops.info/table.html?search=maskmov%20M256&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_ICL=on&cb_ZENp=on&cb_ZEN2=on&cb_ZEN3=on&cb_measurements=on&cb_base=on&cb_avx=on&cb_avx2=on)
 * They are used to process tails, not the whole range

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vector algorithms with AVX2 masked stores and AMD processors #5068

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vector algorithms with AVX2 masked stores and AMD processors #5068

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions