Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[CUDA] Faster rms norm for small dimension#2838

Merged
awni merged 2 commits intoml-explore:mainfrom
awni:faster_rms_norm
Nov 26, 2025
Merged

[CUDA] Faster rms norm for small dimension#2838
awni merged 2 commits intoml-explore:mainfrom
awni:faster_rms_norm

Conversation

@awni
Copy link
Member

@awni awni commented Nov 26, 2025

Benchmark for RMS norm and VJP with total size 1024*1024*8 and varying the last dimension that is normalized over

Forward

D Pre (ms) Post (ms)
64 4.333 0.567
128 2.296 0.560
256 1.260 0.551
512 0.767 0.604
1024 0.718 0.607
2048 0.736 0.614
4096 0.772 0.625
8192 0.984 0.691

VJP

D Pre (ms) Post (ms)
64 12.24 3.865
128 6.532 2.844
256 3.321 2.279
512 2.452 2.141
1024 2.269 1.974
2048 2.277 1.982
4096 2.448 2.131
8192 2.884 2.362

@awni
Copy link
Member Author

awni commented Nov 26, 2025

The improvement for pretraining 0.6B is ok but not as much as I was hoping:

On B200:

Pre: toks_per_sec: 96970.51
Post: toks_per_sec: 105429.75

@awni awni requested a review from zcbenz November 26, 2025 16:13
@awni awni changed the title [WIP][CUDA] Faster rms norm for small dimension [CUDA] Faster rms norm for small dimension Nov 26, 2025
Copy link
Collaborator

@zcbenz zcbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice improvement!

@awni awni merged commit dd79d3c into ml-explore:main Nov 26, 2025
12 checks passed
@awni awni deleted the faster_rms_norm branch December 3, 2025 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants