Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@jlgreathouse
Copy link
Contributor

rocm_agent_enumerator currently calls rocminfo to find what gfx architectures are available on the current system. This is used by, for instance, compilers that want to query what to natively build for if they are not provided with a gfxarch target.

However, rocminfo is a very heavyweight method of getting the gfxarch. It queries a large amount of HSA topology information, and opens up /dev/kfd for various querying purposes. This can make builds slow, as each large, slow query to simply get the gfxarch takes a long time.

In addition, it's possible to do a large number of parallel builds (e.g. make -j, even when targeting the number of processors on large server systems). /dev/kfd has a limited number of concurrent users, meaning that it can quickly exhaust its resources. This can lead to incorrect compilations, because no gfxarch would be returned from rocminfo.

rocm_agent_enumerator is supposed to have a fallback path when rocminfo finds no GPUs. It uses lspci to find AMD GPU device numbers, then looks them up to a hard-coded table. However, this table is woefully out of date, and the call to lspci is broken anyway. So rocm_agent_enumerator would simply fail to return a gfxarch is rocminfo failed to return that gfxarch.

This patchset:

  1. Fixes the woefully out-of-date PCI ID table. It also fixes the call to lspci so that it actually works.
  2. Adds lspci to the dependency list so that we don't end up shipping Docker containers that don't include proper tools.
  3. Switches the order of device queries so that we call the lower-weight lspci first, and only fall back to the heavyweight rocminfo is our PCI ID list falls out of date.

The PCI ID backup method in rocm_agent_enumerator, where the
tool uses lspci to find all AMD GPU devices in the system and
manaully match them to gfx version, is extremely outdated. The
PCI ID list did not include anything after Vega 10, and the
actual call to lspci no longer returned anything due to some
missing conversions.

The patch adds all GPUs that might be needed by ROCr up through
Navy Flounder. The PCI ID to gfx matching pulls from the amdgpu
driver and libhsakmt.
When building packages, add in pciutils as a dependency because
rocm_agent_enumerator uses this as a mechanism for looking up
what GPUs exist on the system.
rocminfo is a very heavyweight mechanism for learning a lot of
information about the GPUs that are attached to the system.
It opens up the limited /dev/kfd resource to gather lots of
information about each device, while rocm_agent_enumerator really
only wants the gfx number of AMD devices attached to the system.

To avoid this heavyweight lookup in most cases, this patch switches
the order of tests. Rather than starting with rocminfo and then
falling back to a poorly-maintained PCI ID list, this patch changes
the agent enumerator to start by checking in the PCI ID list (fast
case) and then falling back to rocminfo (slow case) if the PCI ID
list is out of date.
@fxkamd
Copy link

fxkamd commented Oct 29, 2021

On the most recent kernels we expose the Target version in the sysfs topology to eliminate the lookup in user mode. rocm_agent_enumerator could use that as the first choice when it's available. Look for gfx_target_version in /sys/class/kfd/kfd/topology/nodes/*/properties.

New versions of amdkfd include the gfx architecture version number
for all GPUs surfaced in the HSA topology. This patch adds this as
the preferred way for rocm_agent_enumerator to check for supported
gfx architecture numbers.

Kernels that are missing this feature will not have the value in
the topology. rocm_agent_enumerator will fall back to checking
against the PCI IDs in this case. If PCI IDs fail, we fall back
to the heavyweight rocminfo method.
@jlgreathouse
Copy link
Contributor Author

On the most recent kernels we expose the Target version in the sysfs topology to eliminate the lookup in user mode. rocm_agent_enumerator could use that as the first choice when it's available. Look for gfx_target_version in /sys/class/kfd/kfd/topology/nodes/*/properties.

Glad you caught this -- I didn't know this had made it into KFD. None of my test systems had it when I wrote the other patches.

I just pushed a further patch to this PR that uses the KFD topology as the primary desired method for finding gfx arch. Fallback to lspci, and then further fallback to rocminfo.

@amd-hsivasun
Copy link

Unable to import to rocm-systems due to merge conflict

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants