[AOTAutograd] tweak min-cut partitioner to avoid saving softmax output #126348

shunting314 · 2024-05-15T22:54:40Z

Stack from ghstack (oldest at bottom):

Right now the linear + cross entropy loss operation (usually to be the last part of a transformer model) does the following thing

run matmul to get softmax_input
load softmax_input to compute max per row.
load softmax_input to compute sum per row
load softmax_input, normalize it and save the result to softmax_output

Step 4 is inefficient since
a. in the fwd pass, only a small slice of the softmax_output tensor is need to compute NLLLoss. Materializing the whole tensor is an overkill
b. in the backward pass, we need the whole softmax_output, but it can be recompute from softmax_input

If we skip saving softmax_output, we would have perf wins since this is the largest tensor in the network. For llm.c, the size is batch_size * sequence_length * vocab_size * item_size ~= 32 * 1024 * 50257 * 2 ~= 3GB. Simply read/write such large tensor need ~2ms in A100. If we recompute softmax_output, we save 1 load for softmax_input and 1 store for softmax_output, which would result in ~4ms saving.

To avoid saving the softmax_output we need make sure the min cut partitioner decides to recompute it based on softmax_input and the max/sum tensor (which is small) computed in step 2 and 3. This is not happening currently since the min cut partitioner over-estimate the cost of recomputation.

The fix is suggested by @Chillee to let dist_from_bw play a less important role.

[ghstack-poisoned]

pytorch-bot · 2024-05-15T22:54:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126348

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 13 New Failures, 4 Pending

As of commit 6121330 with merge base 53f73cd ():

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…max output" [ghstack-poisoned]

ghstack-source-id: 964c58d Pull Request resolved: #126348

Chillee

Please add a property-based test in test/inductor/test_perf.py. I'd prefer that to the test currently in the PR.

Also, obviously, do a perf run.

shunting314 · 2024-05-16T23:51:20Z

With the new heuristics, the backward runs slower and we end up with roughly neutral perf overall for llm.c. The reason is the kernel computing gradient of softmax input in the backward pass picks a sub-optimal triton config. #126477 fixes that, and now we have the 4ms saving as estimated in the summary for llm.c:

Latency: 194.20ms -> 190.15ms ,
Tokens/s 168.7K -> 172.3 K (2% improvement)

…ftmax output" Right now the linear + cross entropy loss operation (usually to be the last part of a transformer model) does the following thing 1. run matmul to get softmax_input 2. load softmax_input to compute max per row. 3. load softmax_input to compute sum per row 4. load softmax_input, normalize it and save the result to softmax_output Step 4 is inefficient since a. in the fwd pass, only a small slice of the softmax_output tensor is need to compute NLLLoss. Materializing the whole tensor is an overkill b. in the backward pass, we need the whole softmax_output, but it can be recompute from softmax_input If we skip saving softmax_output, we would have perf wins since this is the largest tensor in the network. For llm.c, the size is batch_size * sequence_length * vocab_size * item_size ~= 32 * 1024 * 50257 * 2 ~= 3GB. Simply read/write such large tensor need ~2ms in A100. If we recompute softmax_output, we save 1 load for softmax_input and 1 store for softmax_output, which would result in ~4ms saving. To avoid saving the softmax_output we need make sure the min cut partitioner decides to recompute it based on softmax_input and the max/sum tensor (which is small) computed in step 2 and 3. This is not happening currently since the min cut partitioner over-estimate the cost of recomputation. The fix is suggested by Chillee to let `dist_from_bw` play a less important role. [ghstack-poisoned]

shunting314 · 2024-05-21T23:36:00Z

Also, obviously, do a perf run.

Perf run shows 5 seconds compilation time regress for TIMM link. I'll need debug where that comes from.

…ftmax output" Right now the linear + cross entropy loss operation (usually to be the last part of a transformer model) does the following thing 1. run matmul to get softmax_input 2. load softmax_input to compute max per row. 3. load softmax_input to compute sum per row 4. load softmax_input, normalize it and save the result to softmax_output Step 4 is inefficient since a. in the fwd pass, only a small slice of the softmax_output tensor is need to compute NLLLoss. Materializing the whole tensor is an overkill b. in the backward pass, we need the whole softmax_output, but it can be recompute from softmax_input If we skip saving softmax_output, we would have perf wins since this is the largest tensor in the network. For llm.c, the size is batch_size * sequence_length * vocab_size * item_size ~= 32 * 1024 * 50257 * 2 ~= 3GB. Simply read/write such large tensor need ~2ms in A100. If we recompute softmax_output, we save 1 load for softmax_input and 1 store for softmax_output, which would result in ~4ms saving. To avoid saving the softmax_output we need make sure the min cut partitioner decides to recompute it based on softmax_input and the max/sum tensor (which is small) computed in step 2 and 3. This is not happening currently since the min cut partitioner over-estimate the cost of recomputation. The fix is suggested by Chillee to let `dist_from_bw` play a less important role. [ghstack-poisoned]

ghstack-source-id: dc7a7cb Pull Request resolved: #126348

eellison · 2024-06-13T21:12:22Z

re-request when ready

github-actions · 2024-08-12T21:33:42Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

github-actions · 2024-11-12T02:00:18Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

etaf · 2025-04-20T03:10:42Z

test/functorch/test_aotdispatch.py

+        C = 768
+        V = 50257
+
+        linear = nn.Linear(C, V, bias=False, dtype=torch.bfloat16).cuda()


Hi, may I suggest we replace "cuda" with the GPU_TYPE in this case? XPU also meet the has_triton requirement. You may also skip if not torch.cuda.is_available() for this case. Thanks.

[AOTAutograd] tweak min-cut partitioner to avoid save softmax output

7541f62

[ghstack-poisoned]

shunting314 requested review from ezyang and Chillee as code owners May 15, 2024 22:54

pytorch-bot bot added ciflow/inductor labels May 15, 2024

Update on "[AOTAutograd] tweak min-cut partitioner to avoid save soft…

f5b029b

…max output" [ghstack-poisoned]

shunting314 added a commit that referenced this pull request May 15, 2024

[AOTAutograd] tweak min-cut partitioner to avoid save softmax output

e5a8e10

ghstack-source-id: 964c58d Pull Request resolved: #126348

shunting314 changed the title ~~[AOTAutograd] tweak min-cut partitioner to avoid save softmax output~~ [AOTAutograd] tweak min-cut partitioner to avoid saving softmax output May 15, 2024

shunting314 requested a review from eellison May 15, 2024 23:10

ezyang removed their request for review May 16, 2024 00:16

Chillee reviewed May 16, 2024

View reviewed changes

This was referenced May 16, 2024

Triton kernel doing more work uses less registers #126463

Open

[inductor] use smaller RBLOCK for expensive reduction kernels #126477

Closed

shunting314 added a commit that referenced this pull request May 21, 2024

[AOTAutograd] tweak min-cut partitioner to avoid save softmax output

c1bb3e8

ghstack-source-id: dc7a7cb Pull Request resolved: #126348

eellison removed their request for review June 13, 2024 21:12

github-actions bot added the Stale label Aug 12, 2024

github-actions bot closed this Sep 11, 2024

shunting314 removed the Stale label Sep 12, 2024

shunting314 reopened this Sep 12, 2024

github-actions bot added the Stale label Nov 12, 2024

shunting314 removed the Stale label Nov 13, 2024

etaf reviewed Apr 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AOTAutograd] tweak min-cut partitioner to avoid saving softmax output #126348

[AOTAutograd] tweak min-cut partitioner to avoid saving softmax output #126348

shunting314 commented May 15, 2024 •

edited

Loading

pytorch-bot bot commented May 15, 2024 •

edited

Loading

Chillee left a comment •

edited

Loading

shunting314 commented May 16, 2024

shunting314 commented May 21, 2024 •

edited

Loading

eellison commented Jun 13, 2024

github-actions bot commented Aug 12, 2024

github-actions bot commented Nov 12, 2024

etaf Apr 20, 2025

[AOTAutograd] tweak min-cut partitioner to avoid saving softmax output #126348

Are you sure you want to change the base?

[AOTAutograd] tweak min-cut partitioner to avoid saving softmax output #126348

Conversation

shunting314 commented May 15, 2024 • edited Loading

pytorch-bot bot commented May 15, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126348

❌ 13 New Failures, 4 Pending

Chillee left a comment • edited Loading

Choose a reason for hiding this comment

shunting314 commented May 16, 2024

shunting314 commented May 21, 2024 • edited Loading

eellison commented Jun 13, 2024

github-actions bot commented Aug 12, 2024

github-actions bot commented Nov 12, 2024

etaf Apr 20, 2025

Choose a reason for hiding this comment

shunting314 commented May 15, 2024 •

edited

Loading

pytorch-bot bot commented May 15, 2024 •

edited

Loading

Chillee left a comment •

edited

Loading

shunting314 commented May 21, 2024 •

edited

Loading