-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
questionFurther information is requestedFurther information is requestedresolvedSuccessfully resolved without a commitSuccessfully resolved without a commit
Description
The benchmark results #5062 (comment) seem to confirm that #5062 is a pessimization for AMD.
AVX2 mask store timings are bad on recent AMDs.
In addition to the currently in review algorithm, we have one accepted already.
Questions:
- Should Vectorize
remove_copyfor 4 and 8 byte elements #5062 be closed? Or optimizing one vendor somewhat higher than pessimizing the other is still fine? - Should vectorize
replace🎭 #4554 be reevaluated on an AMD? (can use#define _USE_STD_VECTOR_ALGORITHMS 0escape to simulate the "before" state). It is less likely that it makes things worse, as the vectorization advantage was bigger there. - Is this right that we don't do vendor detection using
cpuidinstruction?
Note that we also use masked loads, but I don't have concerns for them:
- They are bad only on AMDs before Zen 2, see timings
- They are used to process tails, not the whole range
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requestedresolvedSuccessfully resolved without a commitSuccessfully resolved without a commit