perf(ROCm): add is_rdna() detection and optimize CE loss for RDNA GPUs#4123
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances performance for Triton kernels on AMD RDNA consumer GPUs by introducing architecture-specific optimizations. It intelligently adjusts kernel launch parameters, specifically the number of warps, to better utilize the unique dual-issue SIMD32 Compute Unit design of RDNA architectures. This results in notable speedups for critical operations like RMS LayerNorm and Cross Entropy Loss, ensuring more efficient execution on a wider range of AMD hardware. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces performance optimizations for AMD RDNA consumer GPUs by detecting the RDNA architecture and tuning Triton kernel parameters. A new is_rdna() function is added, and num_warps is adjusted in calculate_settings and cross_entropy_loss based on RDNA's specific microarchitecture, leading to significant performance gains as shown in the benchmarks. The changes are well-structured and justified. I have one suggestion to improve the implementation of the new is_rdna function for better performance and maintainability.
unsloth/kernels/utils.py
Outdated
| return is_hip() and triton.runtime.driver.active.get_current_target().arch in ( | ||
| "gfx1100", # RDNA3 (RX 7900 XTX/XT, PRO W7900/W7800) | ||
| "gfx1101", # RDNA3 (RX 7800 XT, RX 7700 XT) | ||
| "gfx1102", # RDNA3 (RX 7600 XT/7600) | ||
| "gfx1150", # RDNA3.5 (Strix Point APU) | ||
| "gfx1151", # RDNA3.5 (Strix Halo APU) | ||
| "gfx1200", # RDNA4 (RX 9070 XT) | ||
| "gfx1201", # RDNA4 (RX 9070) | ||
| ) |
There was a problem hiding this comment.
For improved performance, consider using a set for the architecture check instead of a tuple. Membership testing against a set is more efficient (O(1) on average) than a tuple (O(n)). Since this function is cached, the set will only be created once.
| return is_hip() and triton.runtime.driver.active.get_current_target().arch in ( | |
| "gfx1100", # RDNA3 (RX 7900 XTX/XT, PRO W7900/W7800) | |
| "gfx1101", # RDNA3 (RX 7800 XT, RX 7700 XT) | |
| "gfx1102", # RDNA3 (RX 7600 XT/7600) | |
| "gfx1150", # RDNA3.5 (Strix Point APU) | |
| "gfx1151", # RDNA3.5 (Strix Halo APU) | |
| "gfx1200", # RDNA4 (RX 9070 XT) | |
| "gfx1201", # RDNA4 (RX 9070) | |
| ) | |
| return is_hip() and triton.runtime.driver.active.get_current_target().arch in { | |
| "gfx1100", # RDNA3 (RX 7900 XTX/XT, PRO W7900/W7800) | |
| "gfx1101", # RDNA3 (RX 7800 XT, RX 7700 XT) | |
| "gfx1102", # RDNA3 (RX 7600 XT/7600) | |
| "gfx1150", # RDNA3.5 (Strix Point APU) | |
| "gfx1151", # RDNA3.5 (Strix Halo APU) | |
| "gfx1200", # RDNA4 (RX 9070 XT) | |
| "gfx1201", # RDNA4 (RX 9070) | |
| } |
bdfcdf8 to
c1dc4bb
Compare
c1dc4bb to
98abd93
Compare
|
Rebased: Removed duplicate Dependency: This PR depends on #4109 which provides Changes: Only |
Use 16 warps for RDNA in the chunked cross-entropy forward kernel
(large vocab > 65536), matching the existing CDNA optimization.
Benchmarked on W7900 (gfx1100) with actual unsloth kernels (5 trials, median):
- Chunked CE forward (BS=65536): 16 warps = 2.4-2.6x faster than 32
- All other kernels (LayerNorm, RoPE, SwiGLU): default heuristic is
already optimal for RDNA; no modification needed.
Depends on: unslothai#4109 (provides is_rdna() detection)
danielhanchen
left a comment
There was a problem hiding this comment.
Thank you! This works great!
…nslothai#4123)" This reverts commit 4d3e7d7.
Summary
Apply targeted Triton kernel tuning for the chunked cross-entropy forward path on AMD RDNA consumer/workstation GPUs (RDNA3/RDNA4).
Changes
unsloth/kernels/cross_entropy_loss.pynum_warps=16for RDNA32 if not is_cdna() else 16→16 if is_cdna() or is_rdna() else 32Benchmark Results
Hardware: AMD Radeon PRO W7900 (gfx1100, RDNA3, 48GB)
Method: 5 independent trials × 300 iterations each, median reported
Chunked CE Forward (large vocab, BS=65536)
Other kernels — no modification needed
Testing
is_rdna()returnsTrueon W7900 (gfx1100)is_rdna()returnsFalseon NVIDIA GPUs and CDNA GPUscc @danielhanchen