Use CDI for GPU injection for AMD devices for --gpus#52048
Use CDI for GPU injection for AMD devices for --gpus#52048shiv-tyagi wants to merge 1 commit intomoby:masterfrom
Conversation
daemon/devices_amd_linux.go
Outdated
|
|
||
| // Try to detect AMD GPU vendor via CDI cache if cdiCache is available | ||
| if cdiCache != nil { | ||
| vendor, err := discoverGPUVendorFromCDI(cdiCache) |
There was a problem hiding this comment.
One thing about this approach ... this only checks whether the cache includes AMD cdi devices at the point where the daemon is reloaded. In contrast to the other projects where we have added this functionlity, the cache here is started with AutoRefresh enabled meaning that the CDI spec directories are watched for changes to ensure that specs for new devices are detected.
With that in mind, the drivers that one wants to register would have to determine the vendor from the cache for every --gpus request and not only once at startup.
There was a problem hiding this comment.
Yes, makes sense.
I have updated the logic to always discover vendor from CDI registry on the fly since the registry is auto refreshed. I have also verified it is working as expected by deleting the CDI files while the daemon is running and verifying that the vendor discovery fails in that case.
Thanks for the suggestion.
f0f8245 to
faba468
Compare
daemon/devices_amd_linux.go
Outdated
| } | ||
| } | ||
|
|
||
| func getAMDDeviceDrivers(cdiCache *cdi.Cache) map[string]*deviceDriver { |
There was a problem hiding this comment.
This always returns either no driver or just a single "amd" driver. Seems there's no need to return the whole map?
| func getAMDDeviceDrivers(cdiCache *cdi.Cache) map[string]*deviceDriver { | |
| func getAMDDeviceDrivers(cdiCache *cdi.Cache) *deviceDriver { |
| if vendor != "amd.com" { | ||
| return fmt.Errorf("AMD CDI spec not found") | ||
| } |
There was a problem hiding this comment.
IIUC correctly createAMDCDIUpdater is always called when the cdiCache exists. So this error will be always popping up on NVIDIA-only systems?
There was a problem hiding this comment.
Hey,
RegisterGPUDeviceDrivers() checks getNVIDIADeviceDrivers() first and returns as soon as it finds anything, registering those via registerDeviceDriver(...). Only when NVIDIA drivers are not detected do we fall back to getAMDDeviceDrivers().
After we fall back to AMD, createAMDCDIUpdater() makes sure that even if cdiCache is not nil, the updater only runs when AMD’s CDI specs are present (vendor resolves to "amd.com"). That prevents the AMD path from activating on non-AMD systems.
I didn’t change NVIDIA driver detection (getNVIDIADeviceDrivers()) or any of the NVIDIA logic, that path remains as-is.
daemon/devices.go
Outdated
| type vendorLister interface { | ||
| ListVendors() []string | ||
| } | ||
|
|
||
| func discoverGPUVendor(l vendorLister) (string, error) { | ||
| if l == nil { | ||
| return "", fmt.Errorf("vendor lister not available") | ||
| } | ||
|
|
There was a problem hiding this comment.
Why not just pass the vendor list directly?
Perhaps also could use a name like:
| type vendorLister interface { | |
| ListVendors() []string | |
| } | |
| func discoverGPUVendor(l vendorLister) (string, error) { | |
| if l == nil { | |
| return "", fmt.Errorf("vendor lister not available") | |
| } | |
| func getFirstAvailableVendor(vendorList []string) (string, error) { | |
There was a problem hiding this comment.
Makes sense. I have applied this suggestion. Thanks.
Signed-off-by: Shiv Tyagi <[email protected]>
faba468 to
5ec493c
Compare
Closes #49824
This PR enhances the functionality of the
--gpusoption for AMD GPUs by utilizing CDI (Container Device Interface) specs for device injection when available. It falls back to the existing vendor runtime-based injection if AMD CDI specs are not detected on the machine.Related PR: containerd/containerd#12839 (Similar implementation for
containerd/ctr)- What I did
Added support for CDI-based GPU device injection through
--gpusoption for AMD devices.- How I did it
Created a similar composite device driver like NVIDIA's which discovers if AMD's CDI specs are there on the system during registration and registers itself with appropriate updaters to handle the device request.
- How to verify it
make binary.dockerdinstance via./bundles/binary/dockerd.amd-ctk cdi generateto install the CDI specs on the host.docker run --rm --gpus all rocm/rocm-terminal rocm-smi.dockerdprocess.docker run --rm --runtime="amd" --gpus all rocm/rocm-terminal rocm-smi.AMD_VISIBLE_DEVICESset when CDI specs are not there to verify that the fallback is working correctly.I have also added unit tests for vendor discovery function.
- Human readable description for the release notes
- A picture of a cute animal (not mandatory but encouraged)