Thanks to visit codestin.com
Credit goes to github.com

Skip to content

qqmm#2789

Merged
awni merged 50 commits intoml-explore:mainfrom
nastya236:qq-matmul
Dec 16, 2025
Merged

qqmm#2789
awni merged 50 commits intoml-explore:mainfrom
nastya236:qq-matmul

Conversation

@nastya236
Copy link
Collaborator

@nastya236 nastya236 commented Nov 18, 2025

This PR adds a new operation mx.qqmm. The current structure is probably neither optimal nor final.

General comment

  1. For inference we want to support: qqmm(quantized weights, bf16 activations).
  2. For training (vjp) we unfortunately still need bf16 weights for two reasons:
    • We currently do not have 2D scaling for nvfp4, so we need to transpose and quantize again along a different dimension.
    • For mxfp8, the recommended recipe is to quantize with 1D blocks and keep two views of the weights (normal and transposed).

Therefore, mx.qqmm supports: (non-quantised x, non-quantised w) or (non-quantised x, quantised w, scales_w)

Details

  1. scales are repacked on every call for both weights and activations. In the future, we probably want to:

    • Avoid repacking weight scales for inference.
    • Fuse quantization and repacking, and directly pack into swizzled layout in fp_quantize.
  2. Batched qqmm is currently not supported; inputs must be 2D. For now it is implemented this way because:

    • CUBLASLT_BATCH_MODE_STRIDED is not supported for scales.
    • CUBLASLT_BATCH_MODE_POINTER_ARRAY is not supported for arrays with block scaling.
  3. qqmm is always executed in TN layout (transpose = True).
    There are several reasons for this, but mainly we always quantize along the reduction dimension, which currently ends up being the last dimension.. I am happy to change this if you think that it is useful to support all layouts for mxfp8 for example. Also, only on B200 only TN layout is supported for nvfp4 and mxfp4.

  4. There are some changes to cublas_gemm.cpp: I grouped all common cuBLAS-related functions into a separate helper class in cublas_utils.cpp.
    We almost certainly want to add batching in the future, but for simplicity batch_count = 1 for now.

  • For nvfp4 the output matches exactly for every tested shape.
  • The difference is not structured: there is no clear pattern, and the indices of the affected elements change with the seed.
  • The mismatch is always exactly 1 ULP.

Therefore, I attribute this to differences in accumulation on tensor cores or other numerical details we do not control.

What this PR lacks [these] because I first want to make sure the rest of the API looks reasonable

  1. addmm -- basically c is always nullptr
  2. nn.QQLinear
  3. nn.Linear.to_qqlinear - or similar method to cast to nn.QQLinear (naming is questionable)

Examples are in python/examples/qqmm.py.
Happy to iterate and change anything here!


To give rough estimation on the quantization + scales packing overhead:
For (M=4*4096, N=9728, K=2560) ~72% is matmul. Scales repacking takes ~half of the total quantization time..

Screenshot 2025-12-15 at 18 06 27

@nastya236 nastya236 marked this pull request as draft November 18, 2025 18:43
@nastya236 nastya236 changed the title qqmm [WIP] qqmm Nov 18, 2025
@awni
Copy link
Member

awni commented Dec 16, 2025

This is looking awesome! I left a few more comments. Could you address, rebase and then we can run the tests and try and get it merged?

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@awni awni merged commit 4cf5b29 into ml-explore:main Dec 16, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants