-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Winograd Update for Small batches #1803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This update on its own should not help training benchmarks like cifar since the added kernel does not do well with large batch sizes. Further updates focused on batches should help that. In the meantime, this should improve batch size = 1 workloads and also reduce at least 1 copy of the inputs that might be used for padding The numbers below compare perf to PyTorch in the last column |
|
I ran this ResNet inference benchmark on M2 Ultra. Pre/post below. TLDR nice speedup on batch size = 1, slight slow down on 32+, is that expected?
|
I guess that's what you meant. Should we just dispatch to the old kernel with batch size 32+? |
Done! I moved them back to the old routing for now |
angeloskath
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job as always :-) The MMATiles result in very readable kernels.
I left a few comments with the main one being an issue in the op with routing to matmul and the rest are either nitpicks or discussion.
|
|
||
| // Iterate over C | ||
| for (int c = 0; c < params.C; c += BC) { | ||
| #define tmp_load_wt_idx(o, h, w, c) h* FA* BC* BO + w* BC* BO + c* BO + o |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not implement these outside the kernel?
Also nitpick: h* FA* BC -> h * FA * BC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These ended up here because I was experimenting with the strides we load things in and it ends up a easier scroll to understand near the reading
1cdd806 to
2f88ec8
Compare
angeloskath
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment on the routing to matmul. Otherwise looks great!
| /*B_batch_stride = */ 0, | ||
| /*matrix_stride_out = */ 0, | ||
| /*copies = */ empty_copies); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say the if below needs an else if
This reverts commit 2dc307f.
* Build in padding to Winograd kernels * Add new fused Winograd kernel * Enable weight flipping in Winograd kernels
Proposed changes
Checklist
Put an
xin the boxes that apply.pre-commit run --all-filesto format my code / installed pre-commit prior to committing changes