Update rocm_agent_enumerator to better handle numerous parallel usages #47
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
rocm_agent_enumeratorcurrently callsrocminfoto find what gfx architectures are available on the current system. This is used by, for instance, compilers that want to query what to natively build for if they are not provided with a gfxarch target.However,
rocminfois a very heavyweight method of getting the gfxarch. It queries a large amount of HSA topology information, and opens up/dev/kfdfor various querying purposes. This can make builds slow, as each large, slow query to simply get the gfxarch takes a long time.In addition, it's possible to do a large number of parallel builds (e.g.
make -j, even when targeting the number of processors on large server systems)./dev/kfdhas a limited number of concurrent users, meaning that it can quickly exhaust its resources. This can lead to incorrect compilations, because no gfxarch would be returned fromrocminfo.rocm_agent_enumeratoris supposed to have a fallback path whenrocminfofinds no GPUs. It useslspcito find AMD GPU device numbers, then looks them up to a hard-coded table. However, this table is woefully out of date, and the call tolspciis broken anyway. Sorocm_agent_enumeratorwould simply fail to return a gfxarch isrocminfofailed to return that gfxarch.This patchset:
lspciso that it actually works.lspcito the dependency list so that we don't end up shipping Docker containers that don't include proper tools.lspcifirst, and only fall back to the heavyweightrocminfois our PCI ID list falls out of date.