implement faster RoPE embedding #238

HuyNguyen-hust · 2024-03-12T16:05:59Z

PR proposes a bit change to the current RoPE embedding kernel:

The current implementation launches 1 block for 1 head on axis 1. Each block has to reload the same sin/cos which is inefficient.
Reorganize grid that on axis 1, instead of launching a block for a head, I launch a block for a group of heads (4-8 heads). That enables loading sin/cos only once and reuse it to compute all the heads inside that block.

Benchmark with batch_size=4, head_dim=128, n_heads=32 (// 2 means BLOCK_SIZE=head_dim // 2. If not BLOCK_SIZE=head_dim):

The figure indicates that mine is more sensitive to BLOCK_SIZE.

danielhanchen · 2024-03-13T05:04:57Z

Thanks @HuyNguyen-hust a lot! As per our discussion on Discord - I just want to say thank you again - super apprecitate this! Will do some tests on my end and I'll expedite this PR!

danielhanchen · 2024-03-15T13:07:45Z

@HuyNguyen-hust I tested the kernel! Can confirm RoPE itself should be faster. The effect on a full training run though is less pronounced sadly, since through Pytorch's Profiler, RoPE itself now takes around 1% of the total runtime, with matrix multiplications taking the bulk of the time. DPO for eg - with your RoPE fix: 1553 seconds. Original: 1542 seconds. So within the margin of error. This was on Colab T4, so I'm pretty sure A100s get more noticeable effects.

However, your kernel works absolute wonders when long sequence lengths come into play! The RoPE kernel does creep up to around 2-3% of the total runtime, which means savings are well deserved!

Thanks so much for wonderful contribution - added this in! :)

I'll probably play around with the group size - it seems like this might be an auto-tunable number!!!

chiennv2000 · 2024-03-16T03:08:20Z

awesome @HuyNguyen-hust, congrats on your great work!

hieule88 · 2024-04-09T19:18:14Z

awesome @HuyNguyen-hust, congrats on your great work!

mohsen202 · 2024-05-02T06:22:49Z

thanks

namnh194 · 2024-06-28T02:05:19Z

cool :O

ngocbh · 2024-09-21T21:26:31Z

Congrats @HuyNguyen-hust! Great contribution!

implement faster RoPE embedding

d9e1d3c

danielhanchen merged commit 809bdbe into unslothai:main Mar 15, 2024

vadimkantorov mentioned this pull request Jul 25, 2025

Addind RoPE to pytorch core pytorch/pytorch#149534

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implement faster RoPE embedding #238

implement faster RoPE embedding #238

Uh oh!

HuyNguyen-hust commented Mar 12, 2024 •

edited

Loading

Uh oh!

danielhanchen commented Mar 13, 2024

Uh oh!

danielhanchen commented Mar 15, 2024

Uh oh!

chiennv2000 commented Mar 16, 2024

Uh oh!

hieule88 commented Apr 9, 2024

Uh oh!

mohsen202 commented May 2, 2024

Uh oh!

namnh194 commented Jun 28, 2024

Uh oh!

ngocbh commented Sep 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

implement faster RoPE embedding #238

implement faster RoPE embedding #238

Uh oh!

Conversation

HuyNguyen-hust commented Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielhanchen commented Mar 13, 2024

Uh oh!

danielhanchen commented Mar 15, 2024

Uh oh!

chiennv2000 commented Mar 16, 2024

Uh oh!

hieule88 commented Apr 9, 2024

Uh oh!

mohsen202 commented May 2, 2024

Uh oh!

namnh194 commented Jun 28, 2024

Uh oh!

ngocbh commented Sep 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

HuyNguyen-hust commented Mar 12, 2024 •

edited

Loading