ROCm: Add gfx950 (MI355X/CDNA4) to is_cdna()#4051
ROCm: Add gfx950 (MI355X/CDNA4) to is_cdna()#4051danielhanchen merged 1 commit intounslothai:mainfrom
Conversation
MI355X (gfx950) has the same 1024-thread workgroup limit as MI300X (gfx942), but was missing from is_cdna(), causing all Triton kernels to use num_warps=32 (2048 threads) instead of 16 (1024 threads), resulting in OutOfResources crash. Tested on: 8x AMD Instinct MI355X (gfx950), ROCm 7.1
Summary of ChangesHello @GoldenGrapeGentleman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical compatibility issue for AMD Instinct MI355X (CDNA4) GPUs within the Triton framework. By extending the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for AMD's MI355X (gfx950 / CDNA4) GPUs by including 'gfx950' in the list of CDNA architectures. This is a crucial fix that resolves an 'OutOfResources' error, enabling Triton kernels to use the correct number of warps on this hardware. The change is straightforward, well-justified, and supported by thorough testing results provided in the description. The implementation is correct and follows the existing pattern. Excellent work!
|
Oh thank you - also thanks for the other PRs! Will review |
|
Oh thanks |
|
I was still working on the other PR! |
Great Effort! |
Summary
Add AMD Instinct MI355X (gfx950 / CDNA4) to
is_cdna()so Triton kernels use the correctnum_warps.Problem
is_cdna()only listed gfx940/941/942 (MI300 series). MI355X (gfx950, CDNA4) has the same 1024-thread workgroup limit and 64-thread wavefront size, but was missing. This caused all Triton kernels to usenum_warps=32(2048 threads) instead of 16 (1024 threads):This blocked all training on MI355X.
Change
def is_cdna(): return is_hip() and triton.runtime.driver.active.get_current_target().arch in ( "gfx940", "gfx941", "gfx942", + "gfx950", # CDNA4 (MI350/MI355X) )Hardware verification
Tested on 8× AMD Instinct MI355X (gfx950), ROCm 7.1
Note
Full MI355X support also requires PR #4021 (ROCm GPT-OSS MXFP4→BF16 routing) by @danielhanchen, I closed the full change which is PR #4050. This PR is the additional piece needed for CDNA4 Triton kernel compatibility.