Fused QKV bitnet #5

Blaizzy · 2025-06-15T01:03:16Z

This PR introduces a fused QKV (Query-Key-Value) implementation in the attention module for BitNet-1.58-2B on MLX. The fusion significantly improves prompt and generation speed by around 10%.

Key Changes:

Added support for fused QKV projection in the attention layer.
Updated model forward pass to conditionally use the fused path when enabled.

Benchmarked performance improvements (M3 Max)"

Prompt Processing: ↑ from 128.77 to 139.94 tokens/sec
Generation Speed: ↑ from 67.05 to 73.12 tokens/sec
MLX Fused QKV vs BitNet 4T = 27.6% faster generation, 137% faster prompt

…o pc/fused-bitnet

Blaizzy and others added 30 commits April 18, 2025 00:27

add bitnet

6f0e85b

update activation to relu2

2fa462f

working bitnet

0e9628c

remove artifacts

3e7c1a9

remove logging

29151ed

add custom post quant

eea94a8

fix dtype and add compile

ae98be8

fixed weight unpack

c137f3c

add custom kernel to avoid memory overhead

eb3846e

compile relu2

026b600

fix weight scale

5ed1b1c

remove unused

058c792

Merge branch 'ml-explore:main' into pc/add-bitnet

cd1783d

add tests and update tuner utils

5a8e952

update acknowledgements

e893d85

add kernel caching

80e8ce5

add act_quant and set float16 as default dtype

bd58d3f

use mx.add and move scaling to kernel

c89491c

remove act quant

5d816e8

move bitlinear layers to separate file

1a076ea

feat: add falcon-e and other bitnet support

ec416eb

refactor: address comments

f3b84e5

Merge pull request #1 from younesbelkada/add-falcon-e

a9f257c

add support for 1.58bit N-bit quants

fb40d51

43.85% speedup in generation performance (M3 max)

3aaba20

refactor utils

3a5e4f9

remove masking (2% gen speed improvement)

3a8136f

add quantization config

ae5a6a0

test llama bitnet

be68e46

refactor apply_hf_quant

3d5422b

Blaizzy added 8 commits June 12, 2025 12:32

default threadgroup: 64 -> 32

feae07b

add comment

3812c27

fix prompt processing perf

9207a39

remove modulo

a0b4026

compile kernel in the constructor

4fab6fc

add fused kernel

239072e

rename

8f20baa

refactor

9a8b9e8

Blaizzy mentioned this pull request Jun 15, 2025

Add bitnet1.58 with custom metal kernel ml-explore/mlx-lm#219

Merged

Blaizzy and others added 9 commits June 15, 2025 04:34

Increase lanes from 4 to 8

0391d22

feat: add fused QKV for other bitnet models

c007c1f

refactor compiled kernel

2180426

address all comments

f21ec19

Merge pull request #6 from younesbelkada/add-fused-falcon-e

02ee4f4

Improve the bitnet kernel

594f4eb

remove benchmark

4e7a8e0

refactor bitlinear swap

5700315

format

2ffcb79

awni force-pushed the pc/add-bitnet branch 2 times, most recently from 00842d2 to 7e1666b Compare July 2, 2025 20:30

Merge branch 'pc/add-bitnet' of https://github.com/Blaizzy/mlx-lm int…

c31f737

…o pc/fused-bitnet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fused QKV bitnet #5

Fused QKV bitnet #5

Uh oh!

Blaizzy commented Jun 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fused QKV bitnet #5

Are you sure you want to change the base?

Fused QKV bitnet #5

Uh oh!

Conversation

Blaizzy commented Jun 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants