Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fix issues with async TP#117

Open
LucasWilkinson wants to merge 2 commits into
mainfrom
lwllkinson/fix-async-tp
Open

Fix issues with async TP#117
LucasWilkinson wants to merge 2 commits into
mainfrom
lwllkinson/fix-async-tp

Conversation

@LucasWilkinson
Copy link
Copy Markdown
Collaborator

FA3 always using PDL can cause a deadlock when combined with async TP which also uses PDL (in PyTorch's symmetric memory)

Signed-off-by: Lucas Wilkinson <[email protected]>
…mputed

When scheduler metadata is computed separately (skip_scheduler_metadata_computation=true),
there may be other PDL users (e.g., symmetric memory all-reduce for async TP) between
the scheduler call and the attention call. These can interfere with FA3's PDL signaling
chain, causing hangs.

This extends the previous fix (disabling prepare_varlen PDL) to also disable the
main kernel -> combine kernel PDL when using pre-computed scheduler metadata.

Signed-off-by: Lucas Wilkinson <[email protected]>
@LucasWilkinson LucasWilkinson force-pushed the lwllkinson/fix-async-tp branch from 5f86b74 to c427cae Compare February 9, 2026 15:41
Copy link
Copy Markdown

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive find!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants