Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Vector algorithms with AVX2 masked stores and AMD processors #5068

@AlexGuteniev

Description

@AlexGuteniev

The benchmark results #5062 (comment) seem to confirm that #5062 is a pessimization for AMD.

AVX2 mask store timings are bad on recent AMDs.

In addition to the currently in review algorithm, we have one accepted already.

Questions:

  • Should Vectorize remove_copy for 4 and 8 byte elements #5062 be closed? Or optimizing one vendor somewhat higher than pessimizing the other is still fine?
  • Should vectorize replace 🎭 #4554 be reevaluated on an AMD? (can use #define _USE_STD_VECTOR_ALGORITHMS 0 escape to simulate the "before" state). It is less likely that it makes things worse, as the vectorization advantage was bigger there.
  • Is this right that we don't do vendor detection using cpuid instruction?

Note that we also use masked loads, but I don't have concerns for them:

  • They are bad only on AMDs before Zen 2, see timings
  • They are used to process tails, not the whole range

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requestedresolvedSuccessfully resolved without a commit

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions