Winograd Update for Small batches #1803

jagrit06 · 2025-01-27T22:02:35Z

Proposed changes

Build in padding to Winograd kernels
Add new fused Winograd kernel
Enable weight flipping in Winograd kernels

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

jagrit06 · 2025-01-27T22:05:12Z

This update on its own should not help training benchmarks like cifar since the added kernel does not do well with large batch sizes. Further updates focused on batches should help that. In the meantime, this should improve batch size = 1 workloads and also reduce at least 1 copy of the inputs that might be used for padding

The numbers below compare perf to PyTorch in the last column

M3 Max Before:
(4, 128, 128, 32), (32, 3, 3, 32), float32, (1, 1), (1, 1), 0.343, -21.22%
(4, 128, 128, 64), (64, 3, 3, 64), float32, (1, 1), (1, 1), 1.013, -26.64%
(256, 16, 16, 32), (32, 3, 3, 32), float32, (1, 1), (1, 1), 0.352, -27.20%
(256, 8, 8, 64), (64, 3, 3, 64), float32, (1, 1), (1, 1), 0.340, -26.26%
(4, 128, 128, 128), (128, 3, 3, 128), float32, (1, 1), (1, 1), 2.788, -5.70%
(1, 16, 16, 32), (32, 3, 3, 32), float32, (1, 1), (1, 1), 0.074, +47.40%
(1, 16, 16, 64), (64, 3, 3, 64), float32, (1, 1), (1, 1), 0.106, -2.51%
(1, 128, 128, 128), (128, 3, 3, 128), float32, (1, 1), (1, 1), 0.780, -7.04%
(1, 128, 128, 256), (256, 3, 3, 256), float32, (1, 1), (1, 1), 2.261, +16.55%
(1, 16, 16, 128), (128, 3, 3, 128), float32, (1, 1), (1, 1), 0.157, -26.23%
(1, 16, 16, 256), (256, 3, 3, 256), float32, (1, 1), (1, 1), 0.274, -27.10%

M3 Max After:
(4, 128, 128, 32), (32, 3, 3, 32), float32, (1, 1), (1, 1), 0.342, -27.23%
(4, 128, 128, 64), (64, 3, 3, 64), float32, (1, 1), (1, 1), 1.014, -26.30%
(256, 16, 16, 32), (32, 3, 3, 32), float32, (1, 1), (1, 1), 0.346, -26.63%
(256, 8, 8, 64), (64, 3, 3, 64), float32, (1, 1), (1, 1), 0.340, -25.95%
(4, 128, 128, 128), (128, 3, 3, 128), float32, (1, 1), (1, 1), 2.179, +20.46%
(1, 16, 16, 32), (32, 3, 3, 32), float32, (1, 1), (1, 1), 0.052, +108.45%
(1, 16, 16, 64), (64, 3, 3, 64), float32, (1, 1), (1, 1), 0.059, +64.23%
(1, 128, 128, 128), (128, 3, 3, 128), float32, (1, 1), (1, 1), 0.692, +4.99%
(1, 128, 128, 256), (256, 3, 3, 256), float32, (1, 1), (1, 1), 1.634, +58.94%
(1, 16, 16, 128), (128, 3, 3, 128), float32, (1, 1), (1, 1), 0.082, +51.14%
(1, 16, 16, 256), (256, 3, 3, 256), float32, (1, 1), (1, 1), 0.158, +23.45%

M2 Ultra Before:
(4, 128, 128, 32), (32, 3, 3, 32), float32, (1, 1), (1, 1), 0.271, -27.74%
(4, 128, 128, 64), (64, 3, 3, 64), float32, (1, 1), (1, 1), 0.750, -35.50%
(256, 16, 16, 32), (32, 3, 3, 32), float32, (1, 1), (1, 1), 0.272, -25.79%
(256, 8, 8, 64), (64, 3, 3, 64), float32, (1, 1), (1, 1), 0.241, -22.92%
(4, 128, 128, 128), (128, 3, 3, 128), float32, (1, 1), (1, 1), 1.972, -23.75%
(1, 16, 16, 32), (32, 3, 3, 32), float32, (1, 1), (1, 1), 0.072, +45.15%
(1, 16, 16, 64), (64, 3, 3, 64), float32, (1, 1), (1, 1), 0.121, -15.60%
(1, 128, 128, 128), (128, 3, 3, 128), float32, (1, 1), (1, 1), 0.720, -32.91%
(1, 128, 128, 256), (256, 3, 3, 256), float32, (1, 1), (1, 1), 2.147, -24.63%
(1, 16, 16, 128), (128, 3, 3, 128), float32, (1, 1), (1, 1), 0.197, -32.59%
(1, 16, 16, 256), (256, 3, 3, 256), float32, (1, 1), (1, 1), 0.355, -39.01%

M2 Ultra After:
(4, 128, 128, 32), (32, 3, 3, 32), float32, (1, 1), (1, 1), 0.271, -27.77%
(4, 128, 128, 64), (64, 3, 3, 64), float32, (1, 1), (1, 1), 0.750, -35.65%
(256, 16, 16, 32), (32, 3, 3, 32), float32, (1, 1), (1, 1), 0.271, -25.83%
(256, 8, 8, 64), (64, 3, 3, 64), float32, (1, 1), (1, 1), 0.241, -23.95%
(4, 128, 128, 128), (128, 3, 3, 128), float32, (1, 1), (1, 1), 1.650, -8.85%
(1, 16, 16, 32), (32, 3, 3, 32), float32, (1, 1), (1, 1), 0.046, +122.89%
(1, 16, 16, 64), (64, 3, 3, 64), float32, (1, 1), (1, 1), 0.055, +84.06%
(1, 128, 128, 128), (128, 3, 3, 128), float32, (1, 1), (1, 1), 0.572, -15.53%
(1, 128, 128, 256), (256, 3, 3, 256), float32, (1, 1), (1, 1), 1.258, +28.53%
(1, 16, 16, 128), (128, 3, 3, 128), float32, (1, 1), (1, 1), 0.076, +67.96%
(1, 16, 16, 256), (256, 3, 3, 256), float32, (1, 1), (1, 1), 0.139, +55.04%

awni · 2025-01-27T23:52:35Z

I ran this ResNet inference benchmark on M2 Ultra.

Pre/post below.

TLDR nice speedup on batch size = 1, slight slow down on 32+, is that expected?

Batch Size	Images-per-second	Milliseconds-per-image
1	608.598	1.643
2	1147.084	0.872
4	1426.277	0.701
8	1869.651	0.535
16	2133.113	0.469
32	2308.229	0.433
64	2529.965	0.395

Batch Size	Images-per-second	Milliseconds-per-image
1	747.178	1.338
2	1146.119	0.873
4	1539.774	0.649
8	1921.938	0.520
16	2035.330	0.491
32	2062.188	0.485
64	2106.746	0.475

awni · 2025-01-27T23:54:10Z

since the added kernel does not do well with large batch sizes. Further updates focused on batches should help that.

I guess that's what you meant. Should we just dispatch to the old kernel with batch size 32+?

jagrit06 · 2025-01-28T19:11:05Z

since the added kernel does not do well with large batch sizes. Further updates focused on batches should help that.

I guess that's what you meant. Should we just dispatch to the old kernel with batch size 32+?

Done! I moved them back to the old routing for now

angeloskath

Awesome job as always :-) The MMATiles result in very readable kernels.

I left a few comments with the main one being an issue in the op with routing to matmul and the rest are either nitpicks or discussion.

mlx/ops.cpp

angeloskath · 2025-01-29T23:13:57Z

mlx/backend/metal/kernels/conv.metal

+
+  // Iterate over C
+  for (int c = 0; c < params.C; c += BC) {
+#define tmp_load_wt_idx(o, h, w, c) h* FA* BC* BO + w* BC* BO + c* BO + o


Why not implement these outside the kernel?

Also nitpick: h* FA* BC -> h * FA * BC

These ended up here because I was experimenting with the strides we load things in and it ends up a easier scroll to understand near the reading

mlx/backend/metal/kernels/conv.metal

angeloskath

Left a comment on the routing to matmul. Otherwise looks great!

angeloskath · 2025-02-14T20:40:23Z

mlx/backend/metal/conv.cpp

+        /*B_batch_stride = */ 0,
+        /*matrix_stride_out = */ 0,
+        /*copies = */ empty_copies);
+  }


I would say the if below needs an else if

This reverts commit 2dc307f.

* Build in padding to Winograd kernels * Add new fused Winograd kernel * Enable weight flipping in Winograd kernels

jagrit06 requested review from angeloskath and awni and removed request for angeloskath January 28, 2025 19:11

angeloskath reviewed Jan 29, 2025

View reviewed changes

jagrit06 added 13 commits February 13, 2025 20:38

Remove unnecessary copy from winograd

9df1bb8

TMP

acf748a

TMP

a8543b9

TMP - numerical issues

81081ea

TMP - numerical issues

868dc31

Functioning

4903f16

Enable flip

108cfeb

Redirection between fused vs unfused winograd

c422f50

Direct conv specialization

dc973ab

Compiler warnings

2ce133c

Enable large batches to go to old kernel

9baa188

Address direct conv routing comment

26e37c6

Address direct conv kernel comments

2f88ec8

jagrit06 force-pushed the winograd_tmp branch from 1cdd806 to 2f88ec8 Compare February 14, 2025 04:46

jagrit06 added 2 commits February 14, 2025 12:05

Update routing

deb2f13

Move 1x1 routing out of op

c50ca56

angeloskath approved these changes Feb 14, 2025

View reviewed changes

Address comment

8f51d29

jagrit06 merged commit 2dc307f into main Feb 14, 2025
5 checks passed

jagrit06 deleted the winograd_tmp branch February 14, 2025 21:08

angeloskath mentioned this pull request Feb 14, 2025

Remove unused uniform #1867

Merged

awni mentioned this pull request Feb 17, 2025

[BUG] Running Conv2d multiple times returns different results in 0.23.0 #1873

Closed

angeloskath added a commit that referenced this pull request Feb 18, 2025

Revert "Winograd Update for Small batches (#1803)"

0bbc621

This reverts commit 2dc307f.

angeloskath mentioned this pull request Feb 18, 2025

Fix convs by reverting #1803 #1882

Merged

angeloskath added a commit that referenced this pull request Feb 18, 2025

Fix convs by reverting #1803 (#1882)

71de73a

faisalmemon pushed a commit to faisalmemon/mlx that referenced this pull request Oct 30, 2025

Winograd Update for Small batches (ml-explore#1803)

35cd0c8

* Build in padding to Winograd kernels * Add new fused Winograd kernel * Enable weight flipping in Winograd kernels

faisalmemon pushed a commit to faisalmemon/mlx that referenced this pull request Oct 30, 2025

Fix convs by reverting ml-explore#1803 (ml-explore#1882)

853adc8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Winograd Update for Small batches #1803

Winograd Update for Small batches #1803

Uh oh!

jagrit06 commented Jan 27, 2025 •

edited

Loading

Uh oh!

jagrit06 commented Jan 27, 2025

Uh oh!

awni commented Jan 27, 2025

Uh oh!

awni commented Jan 27, 2025

Uh oh!

jagrit06 commented Jan 28, 2025

Uh oh!

angeloskath left a comment

Uh oh!

Uh oh!

angeloskath Jan 29, 2025

Uh oh!

jagrit06 Feb 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

angeloskath left a comment

Uh oh!

angeloskath Feb 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Winograd Update for Small batches #1803

Winograd Update for Small batches #1803

Uh oh!

Conversation

jagrit06 commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Uh oh!

jagrit06 commented Jan 27, 2025

Uh oh!

awni commented Jan 27, 2025

Uh oh!

awni commented Jan 27, 2025

Uh oh!

jagrit06 commented Jan 28, 2025

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

angeloskath Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

jagrit06 Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

angeloskath Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jagrit06 commented Jan 27, 2025 •

edited

Loading