(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910) #64826

cbalioglu · 2021-09-10T16:57:41Z

(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910)

Summary:
Pull Request resolved: #63910

Addresses the current issue that init_method=tcp:// is not compatible with torch.distributed.run and torch.distributed.launch. When running with a training script that initializes the process group with init_method=tcp://localhost:$port as such:

$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py

An Address in use error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.

For details see: #63874.

This change does a couple of things:

Adds is_torchelastic_launched() check function that users can use in the training scripts to see whether the script is launched via torchelastic.
Update the torch.distributed docs page to include the new is_torchelastic_launched() function.
Makes init_method=tcp:// torchelastic compatible by modifying _tcp_rendezvous_handler in torch.distributed.rendezvous (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check is_torchelastic_launched() AND torchelastic_use_agent_store() and if so, only create TCPStore clients (no daemons, not even for rank 0).
Adds a bunch of unittests to cover the different code paths

NOTE: the issue mentions that we should fail-fast with an assertion on init_method!=env:// when is_torchelastic_launched() is True. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.

Test Plan: Unittests.

Reviewed By: cbalioglu

Differential Revision: D30529984

fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5

facebook-github-bot · 2021-09-10T16:57:47Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/64826
↩️ [fb-only] Re-run with SSH instructions
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 4de6bf6 (more details on the Dr. CI page):

4/4 failures possibly* introduced in this PR
- 1/4 non-scanned failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

Linux CI (pytorch-linux-xenial-py3.6-gcc5.4) / build (1/2)

Step: "Build PyTorch" (full log | diagnosis details | 🔁 rerun)

2021-09-10T17:56:31.4388480Z Build left local git repository checkout dirty

2021-09-10T17:56:31.1157290Z + [[ pytorch-linux-xenial-py3.6-gcc5.4 != *rocm* ]]
2021-09-10T17:56:31.1158150Z + [[ pytorch-linux-xenial-py3.6-gcc5.4 != *xla* ]]
2021-09-10T17:56:31.1158800Z ++ git status --porcelain
2021-09-10T17:56:31.4381935Z + git_status='?? third_party/breakpad/
2021-09-10T17:56:31.4382478Z ?? third_party/cudnn_frontend/
2021-09-10T17:56:31.4383134Z ?? third_party/pocketfft/'
2021-09-10T17:56:31.4383935Z + [[ -n ?? third_party/breakpad/
2021-09-10T17:56:31.4384590Z ?? third_party/cudnn_frontend/
2021-09-10T17:56:31.4384968Z ?? third_party/pocketfft/ ]]
2021-09-10T17:56:31.4387526Z + echo 'Build left local git repository checkout dirty'
2021-09-10T17:56:31.4388480Z Build left local git repository checkout dirty
2021-09-10T17:56:31.4389093Z + echo 'git status --porcelain:'
2021-09-10T17:56:31.4389568Z git status --porcelain:
2021-09-10T17:56:31.4390051Z + echo '?? third_party/breakpad/
2021-09-10T17:56:31.4390449Z ?? third_party/cudnn_frontend/
2021-09-10T17:56:31.4390919Z ?? third_party/pocketfft/'
2021-09-10T17:56:31.4391301Z ?? third_party/breakpad/
2021-09-10T17:56:31.4391670Z ?? third_party/cudnn_frontend/
2021-09-10T17:56:31.4392060Z ?? third_party/pocketfft/
2021-09-10T17:56:31.4392376Z + exit 1
2021-09-10T17:56:31.4392656Z + cleanup

pytorch_macos_10_13_py3_test (2/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Sep 10 18:40:41 ERROR [0.004s]: test_poisson_sample (__main__.TestDistributions)

Sep 10 18:40:41   File "distributions/test_distributions.py", line 805, in _check_sampler_discrete
Sep 10 18:40:41     chisq, p = scipy.stats.chisquare(counts[msk], pmf[msk] * num_samples)
Sep 10 18:40:41   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py", line 6853, in chisquare
Sep 10 18:40:41     lambda_="pearson")
Sep 10 18:40:41   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py", line 6694, in power_divergence
Sep 10 18:40:41     raise ValueError(msg)
Sep 10 18:40:41 ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
Sep 10 18:40:41 0.008265582255680495
Sep 10 18:40:41 
Sep 10 18:40:41 ======================================================================
Sep 10 18:40:41 ERROR [0.004s]: test_poisson_sample (__main__.TestDistributions)
Sep 10 18:40:41 ----------------------------------------------------------------------
Sep 10 18:40:41 Traceback (most recent call last):
Sep 10 18:40:41   File "distributions/test_distributions.py", line 1333, in test_poisson_sample
Sep 10 18:40:41     failure_rate=1e-3)
Sep 10 18:40:41   File "distributions/test_distributions.py", line 805, in _check_sampler_discrete
Sep 10 18:40:41     chisq, p = scipy.stats.chisquare(counts[msk], pmf[msk] * num_samples)
Sep 10 18:40:41   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py", line 6853, in chisquare
Sep 10 18:40:41     lambda_="pearson")
Sep 10 18:40:41   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py", line 6694, in power_divergence
Sep 10 18:40:41     raise ValueError(msg)

1 failure not recognized by patterns:

Job	Step	Action
^{Linux CI (pytorch-linux-xenial-py3.6-gcc5.4) / render_test_results}	^{Download PyTorch Test Reports}	🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm4.2-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

…util method + make init_method=tcp:// compatible with torchelastic (#63910) Summary: Pull Request resolved: #63910 Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such: ``` $ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py ``` An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port. For details see: #63874. This change does a couple of things: 1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic. 1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function. 1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0). 1. Adds a bunch of unittests to cover the different code paths NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue. Test Plan: Unittests. Reviewed By: cbalioglu Differential Revision: D30529984 fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Sep 10, 2021

cbalioglu changed the title ~~(torch.distributed) Add torch.distributed.is_torchelastic_launched() …~~ (torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910) Sep 10, 2021

cbalioglu force-pushed the backport-63910 branch from 050213a to 8d2df31 Compare September 10, 2021 17:10

cbalioglu requested review from aivanou, kiukchung and malfet September 10, 2021 17:21

cbalioglu added this to the 1.9.1 milestone Sep 10, 2021

cbalioglu linked an issue Sep 10, 2021 that may be closed by this pull request

[torch.distributed.run] Assert (and fail fast w/ message) init_method=="env://" on the trainer #63874

Closed

cbalioglu self-assigned this Sep 10, 2021

cbalioglu force-pushed the backport-63910 branch from 8d2df31 to 58a2e6e Compare September 10, 2021 17:29

cbalioglu force-pushed the backport-63910 branch from 58a2e6e to 4de6bf6 Compare September 10, 2021 17:47

cbalioglu marked this pull request as ready for review September 10, 2021 22:37

cbalioglu requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, rohan-varma, wayi1 and zhaojuanmao as code owners September 10, 2021 22:37

malfet mentioned this pull request Sep 12, 2021

[v.1.9.1] Release Tracker #62586

Closed

malfet approved these changes Sep 12, 2021

View reviewed changes

malfet merged commit e2cb357 into release/1.9 Sep 12, 2021

malfet deleted the backport-63910 branch September 12, 2021 17:38

cbalioglu removed their assignment Sep 17, 2021

cbalioglu mentioned this pull request Sep 20, 2021

Distributed Elastic Training Failed on Two Amazon g4dn.xlarge Instances #64990

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910) #64826

(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910) #64826

Uh oh!

cbalioglu commented Sep 10, 2021 •

edited by malfet

Loading

Uh oh!

facebook-github-bot commented Sep 10, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910) #64826

(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910) #64826

Uh oh!

Conversation

cbalioglu commented Sep 10, 2021 • edited by malfet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 2 new failures recognized by patterns

Linux CI (pytorch-linux-xenial-py3.6-gcc5.4) / build (1/2)

pytorch_macos_10_13_py3_test (2/2)

1 failure not recognized by patterns:

ci.pytorch.org: 1 failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cbalioglu commented Sep 10, 2021 •

edited by malfet

Loading

facebook-github-bot commented Sep 10, 2021 •

edited

Loading