Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Blaizzy
Copy link
Owner

@Blaizzy Blaizzy commented Jun 15, 2025

image
This PR introduces a fused QKV (Query-Key-Value) implementation in the attention module for BitNet-1.58-2B on MLX. The fusion significantly improves prompt and generation speed by around 10%.

Key Changes:

  • Added support for fused QKV projection in the attention layer.
  • Updated model forward pass to conditionally use the fused path when enabled.

Benchmarked performance improvements (M3 Max)"

  • Prompt Processing: ↑ from 128.77 to 139.94 tokens/sec
  • Generation Speed: ↑ from 67.05 to 73.12 tokens/sec
  • MLX Fused QKV vs BitNet 4T = 27.6% faster generation, 137% faster prompt

@awni awni force-pushed the pc/add-bitnet branch 2 times, most recently from 00842d2 to 7e1666b Compare July 2, 2025 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants