Codestin Search App

nastya236 · 2025-11-18T18:43:27Z

This PR adds a new operation mx.qqmm. The current structure is probably neither optimal nor final.

General comment

For inference we want to support: qqmm(quantized weights, bf16 activations).
For training (vjp) we unfortunately still need bf16 weights for two reasons:
- We currently do not have 2D scaling for nvfp4, so we need to transpose and quantize again along a different dimension.
- For mxfp8, the recommended recipe is to quantize with 1D blocks and keep two views of the weights (normal and transposed).

Therefore, mx.qqmm supports: (non-quantised x, non-quantised w) or (non-quantised x, quantised w, scales_w)

Details

scales are repacked on every call for both weights and activations. In the future, we probably want to:
- Avoid repacking weight scales for inference.
- Fuse quantization and repacking, and directly pack into swizzled layout in fp_quantize.
Batched qqmm is currently not supported; inputs must be 2D. For now it is implemented this way because:
- CUBLASLT_BATCH_MODE_STRIDED is not supported for scales.
- CUBLASLT_BATCH_MODE_POINTER_ARRAY is not supported for arrays with block scaling.
qqmm is always executed in TN layout (transpose = True).
There are several reasons for this, but mainly we always quantize along the reduction dimension, which currently ends up being the last dimension.. I am happy to change this if you think that it is useful to support all layouts for mxfp8 for example. Also, only on B200 only TN layout is supported for nvfp4 and mxfp4.
There are some changes to cublas_gemm.cpp: I grouped all common cuBLAS-related functions into a separate helper class in cublas_utils.cpp.
We almost certainly want to add batching in the future, but for simplicity batch_count = 1 for now.

For nvfp4 the output matches exactly for every tested shape.
The difference is not structured: there is no clear pattern, and the indices of the affected elements change with the seed.
The mismatch is always exactly 1 ULP.

Therefore, I attribute this to differences in accumulation on tensor cores or other numerical details we do not control.

What this PR lacks [these] because I first want to make sure the rest of the API looks reasonable

addmm -- basically c is always nullptr
nn.QQLinear
nn.Linear.to_qqlinear - or similar method to cast to nn.QQLinear (naming is questionable)

Examples are in python/examples/qqmm.py.
Happy to iterate and change anything here!

To give rough estimation on the quantization + scales packing overhead:
For (M=4*4096, N=9728, K=2560) ~72% is matmul. Scales repacking takes ~half of the total quantization time..

This reverts commit 7a012a7.

This reverts commit e341572.

mlx/backend/cuda/CMakeLists.txt

python/src/ops.cpp

mlx/ops.h

mlx/ops.cpp

mlx/primitives.cpp

mlx/backend/cuda/quantized/qqmm.cpp

awni · 2025-12-16T01:15:55Z

This is looking awesome! I left a few more comments. Could you address, rebase and then we can run the tests and try and get it merged?

awni

LGTM!

nastya236 added 3 commits November 18, 2025 19:08

qqmm

3ed3b11

merge main

56a21b9

merge main

54f9958

nastya236 marked this pull request as draft November 18, 2025 18:43

nastya236 changed the title ~~qqmm~~ [WIP] qqmm Nov 18, 2025

nastya236 and others added 25 commits November 19, 2025 18:17

refactoring

61e30ea

refactoring

9e4879b

quantize activations on the fly

df45b39

quantize in eval

7a012a7

Revert "quantize in eval"

e341572

This reverts commit 7a012a7.

Revert "Revert "quantize in eval""

de49f80

This reverts commit e341572.

on the fly activation quantization

545dded

pre-commit

5318b38

qqmm inputs bf16 second arg

2184744

fix

61d09ea

bf16 weights are optional

6af6eb3

op in python, typos

1e37ef6

typo

9c584f8

batched qqmm

9a83d3c

delete batching

88361a3

string instead of stringz-view

b9e73ab

add 2D input condition

95c275b

force transpose

64b8cbe

fix transpose

4b68595

add pythong tests

ee0ea9f

added qq linear

cc0333e

added tests

34f42fb

docs correctlion

9184f9a

small fixes

110848f

deleted qqlinear for now

6633c4b

nastya236 added 8 commits December 15, 2025 01:01

Merge remote-tracking branch 'upstream/main' into qq-matmul

5e51e0f

compile qqmm only on sm_100

53bde90

fix typos

2ed34c7

dtype_to_cublas_type double def

abf44ea

refactoring, test -> example

463e8ad

both inputs non quatized in vjp

e752e96

typo

a814438

Merge remote-tracking branch 'upstream/main' into qq-matmul

f541eaa