Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Jun 24, 2020

Stack from ghstack:

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

  1. Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
    DDP construction.
  2. Rank 3 is a little slower and performs an RRef fetch call before the DDP
    construction.
  3. The RRef fetch call is done on Rank 0 and tries to acquire GIL.
  4. We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
    collective and Rank 3 is waiting for Rank 0 to release GIL.

Differential Revision: D22205180

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.

Differential Revision: [D22205180](https://our.internmc.facebook.com/intern/diff/D22205180/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Jun 24, 2020
As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.

Differential Revision: [D22205180](https://our.internmc.facebook.com/intern/diff/D22205180/)

ghstack-source-id: 106491684
Pull Request resolved: #40495
@dr-ci
Copy link

dr-ci bot commented Jun 24, 2020

💊 CI failures summary and remediations

As of commit 53d8462 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 4 times.

Copy link
Contributor

@zhaojuanmao zhaojuanmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It should be safe to release GIL here as there is no python related calls in reducer constructor

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.

Differential Revision: [D22205180](https://our.internmc.facebook.com/intern/diff/D22205180/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Jun 24, 2020
Pull Request resolved: #40495

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.
ghstack-source-id: 106534442

Differential Revision: [D22205180](https://our.internmc.facebook.com/intern/diff/D22205180/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in ea06db9.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/145/head branch June 28, 2020 14:17
malfet pushed a commit to malfet/pytorch that referenced this pull request Jul 1, 2020
Summary:
Pull Request resolved: pytorch#40495

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.
ghstack-source-id: 106534442

Test Plan:
1) Ran ddp_under_dist_autograd 500 times.
2) waitforbuildbot

Differential Revision: D22205180

fbshipit-source-id: 6afd55342e801b9edb9591ff25158a244a8ea66a
seemethere pushed a commit that referenced this pull request Jul 1, 2020
Summary:
Pull Request resolved: #40495

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.
ghstack-source-id: 106534442

Test Plan:
1) Ran ddp_under_dist_autograd 500 times.
2) waitforbuildbot

Differential Revision: D22205180

fbshipit-source-id: 6afd55342e801b9edb9591ff25158a244a8ea66a

Co-authored-by: Pritam Damania <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants