Merged
Conversation
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
awni
reviewed
Dec 16, 2025
Member
|
This is looking awesome! I left a few more comments. Could you address, rebase and then we can run the tests and try and get it merged? |
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a new operation
mx.qqmm. The current structure is probably neither optimal nor final.General comment
qqmm(quantized weights, bf16 activations).nvfp4, so we need to transpose and quantize again along a different dimension.mxfp8, the recommended recipe is to quantize with 1D blocks and keep two views of the weights (normal and transposed).Therefore,
mx.qqmmsupports: (non-quantisedx, non-quantisedw) or (non-quantisedx, quantisedw,scales_w)Details
scalesare repacked on every call for both weights and activations. In the future, we probably want to:fp_quantize.Batched
qqmmis currently not supported; inputs must be 2D. For now it is implemented this way because:CUBLASLT_BATCH_MODE_STRIDEDis not supported for scales.CUBLASLT_BATCH_MODE_POINTER_ARRAYis not supported for arrays with block scaling.qqmmis always executed in TN layout (transpose = True).There are several reasons for this, but mainly we always quantize along the reduction dimension, which currently ends up being the last dimension.. I am happy to change this if you think that it is useful to support all layouts for
mxfp8for example. Also, only on B200 only TN layout is supported fornvfp4andmxfp4.There are some changes to
cublas_gemm.cpp: I grouped all common cuBLAS-related functions into a separate helper class incublas_utils.cpp.We almost certainly want to add batching in the future, but for simplicity
batch_count = 1for now.nvfp4the output matches exactly for every tested shape.Therefore, I attribute this to differences in accumulation on tensor cores or other numerical details we do not control.
What this PR lacks [these] because I first want to make sure the rest of the API looks reasonable
addmm-- basicallycis alwaysnullptrnn.QQLinearnn.Linear.to_qqlinear- or similar method to cast tonn.QQLinear(naming is questionable)Examples are in
python/examples/qqmm.py.Happy to iterate and change anything here!
To give rough estimation on the quantization + scales packing overhead:
For
(M=4*4096, N=9728, K=2560)~72% is matmul. Scales repacking takes ~half of the total quantization time..