Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Use CDI for GPU injection for AMD devices for --gpus#52048

Open
shiv-tyagi wants to merge 1 commit intomoby:masterfrom
shiv-tyagi:vendor-detection
Open

Use CDI for GPU injection for AMD devices for --gpus#52048
shiv-tyagi wants to merge 1 commit intomoby:masterfrom
shiv-tyagi:vendor-detection

Conversation

@shiv-tyagi
Copy link

@shiv-tyagi shiv-tyagi commented Feb 16, 2026

Closes #49824

This PR enhances the functionality of the --gpus option for AMD GPUs by utilizing CDI (Container Device Interface) specs for device injection when available. It falls back to the existing vendor runtime-based injection if AMD CDI specs are not detected on the machine.

Related PR: containerd/containerd#12839 (Similar implementation for containerd/ctr)

- What I did
Added support for CDI-based GPU device injection through --gpus option for AMD devices.

- How I did it
Created a similar composite device driver like NVIDIA's which discovers if AMD's CDI specs are there on the system during registration and registers itself with appropriate updaters to handle the device request.

- How to verify it

  1. Built the binaries using make binary.
  2. Started the newly built dockerd instance via ./bundles/binary/dockerd.
  3. Used amd-ctk cdi generate to install the CDI specs on the host.
  4. Test Injection: Ran docker run --rm --gpus all rocm/rocm-terminal rocm-smi.
    • Verified that CDI-based GPU injection works as expected.
  5. Test (Fallback Path):
    • Deleted the CDI specs and restarted the dockerd process.
    • Retried using the runtime flag: docker run --rm --runtime="amd" --gpus all rocm/rocm-terminal rocm-smi.
    • Verified that the vendor runtime still works as a fallback when CDI specs are absent.
    • Checked the environment variable inside the container has AMD_VISIBLE_DEVICES set when CDI specs are not there to verify that the fallback is working correctly.

I have also added unit tests for vendor discovery function.

- Human readable description for the release notes

The `--gpus` option now supports CDI-based injection for AMD GPUs.

- A picture of a cute animal (not mandatory but encouraged)

@github-actions github-actions bot added the area/daemon Core Engine label Feb 16, 2026

// Try to detect AMD GPU vendor via CDI cache if cdiCache is available
if cdiCache != nil {
vendor, err := discoverGPUVendorFromCDI(cdiCache)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing about this approach ... this only checks whether the cache includes AMD cdi devices at the point where the daemon is reloaded. In contrast to the other projects where we have added this functionlity, the cache here is started with AutoRefresh enabled meaning that the CDI spec directories are watched for changes to ensure that specs for new devices are detected.

With that in mind, the drivers that one wants to register would have to determine the vendor from the cache for every --gpus request and not only once at startup.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense.

I have updated the logic to always discover vendor from CDI registry on the fly since the registry is auto refreshed. I have also verified it is working as expected by deleting the CDI files while the daemon is running and verifying that the vendor discovery fails in that case.

Thanks for the suggestion.

@vvoland vvoland added kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. area/cdi labels Feb 16, 2026
@vvoland vvoland added this to the 29.3.0 milestone Feb 16, 2026
}
}

func getAMDDeviceDrivers(cdiCache *cdi.Cache) map[string]*deviceDriver {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always returns either no driver or just a single "amd" driver. Seems there's no need to return the whole map?

Suggested change
func getAMDDeviceDrivers(cdiCache *cdi.Cache) map[string]*deviceDriver {
func getAMDDeviceDrivers(cdiCache *cdi.Cache) *deviceDriver {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Done. Thanks.

Comment on lines +44 to +46
if vendor != "amd.com" {
return fmt.Errorf("AMD CDI spec not found")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC correctly createAMDCDIUpdater is always called when the cdiCache exists. So this error will be always popping up on NVIDIA-only systems?

Copy link
Author

@shiv-tyagi shiv-tyagi Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,

RegisterGPUDeviceDrivers() checks getNVIDIADeviceDrivers() first and returns as soon as it finds anything, registering those via registerDeviceDriver(...). Only when NVIDIA drivers are not detected do we fall back to getAMDDeviceDrivers().
After we fall back to AMD, createAMDCDIUpdater() makes sure that even if cdiCache is not nil, the updater only runs when AMD’s CDI specs are present (vendor resolves to "amd.com"). That prevents the AMD path from activating on non-AMD systems.
I didn’t change NVIDIA driver detection (getNVIDIADeviceDrivers()) or any of the NVIDIA logic, that path remains as-is.

Comment on lines +42 to +50
type vendorLister interface {
ListVendors() []string
}

func discoverGPUVendor(l vendorLister) (string, error) {
if l == nil {
return "", fmt.Errorf("vendor lister not available")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just pass the vendor list directly?

Perhaps also could use a name like:

Suggested change
type vendorLister interface {
ListVendors() []string
}
func discoverGPUVendor(l vendorLister) (string, error) {
if l == nil {
return "", fmt.Errorf("vendor lister not available")
}
func getFirstAvailableVendor(vendorList []string) (string, error) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I have applied this suggestion. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cdi area/daemon Core Engine area/testing kind/enhancement Enhancements are not bugs or new features but can improve usability or performance.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

re-implement --gpus flag using CDI (was "AMD GPU support")

3 participants