Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@cbalioglu
Copy link
Contributor

@cbalioglu cbalioglu commented Sep 10, 2021

(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910)

Summary:
Pull Request resolved: #63910

Addresses the current issue that init_method=tcp:// is not compatible with torch.distributed.run and torch.distributed.launch. When running with a training script that initializes the process group with init_method=tcp://localhost:$port as such:

$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py

An Address in use error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.

For details see: #63874.

This change does a couple of things:

  1. Adds is_torchelastic_launched() check function that users can use in the training scripts to see whether the script is launched via torchelastic.
  2. Update the torch.distributed docs page to include the new is_torchelastic_launched() function.
  3. Makes init_method=tcp:// torchelastic compatible by modifying _tcp_rendezvous_handler in torch.distributed.rendezvous (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check is_torchelastic_launched() AND torchelastic_use_agent_store() and if so, only create TCPStore clients (no daemons, not even for rank 0).
  4. Adds a bunch of unittests to cover the different code paths

NOTE: the issue mentions that we should fail-fast with an assertion on init_method!=env:// when is_torchelastic_launched() is True. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.

Test Plan: Unittests.

Reviewed By: cbalioglu

Differential Revision: D30529984

fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Sep 10, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 10, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 4de6bf6 (more details on the Dr. CI page):


  • 4/4 failures possibly* introduced in this PR
    • 1/4 non-scanned failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build Linux CI (pytorch-linux-xenial-py3.6-gcc5.4) / build (1/2)

Step: "Build PyTorch" (full log | diagnosis details | 🔁 rerun)

2021-09-10T17:56:31.4388480Z Build left local git repository checkout dirty
2021-09-10T17:56:31.1157290Z + [[ pytorch-linux-xenial-py3.6-gcc5.4 != *rocm* ]]
2021-09-10T17:56:31.1158150Z + [[ pytorch-linux-xenial-py3.6-gcc5.4 != *xla* ]]
2021-09-10T17:56:31.1158800Z ++ git status --porcelain
2021-09-10T17:56:31.4381935Z + git_status='?? third_party/breakpad/
2021-09-10T17:56:31.4382478Z ?? third_party/cudnn_frontend/
2021-09-10T17:56:31.4383134Z ?? third_party/pocketfft/'
2021-09-10T17:56:31.4383935Z + [[ -n ?? third_party/breakpad/
2021-09-10T17:56:31.4384590Z ?? third_party/cudnn_frontend/
2021-09-10T17:56:31.4384968Z ?? third_party/pocketfft/ ]]
2021-09-10T17:56:31.4387526Z + echo 'Build left local git repository checkout dirty'
2021-09-10T17:56:31.4388480Z Build left local git repository checkout dirty
2021-09-10T17:56:31.4389093Z + echo 'git status --porcelain:'
2021-09-10T17:56:31.4389568Z git status --porcelain:
2021-09-10T17:56:31.4390051Z + echo '?? third_party/breakpad/
2021-09-10T17:56:31.4390449Z ?? third_party/cudnn_frontend/
2021-09-10T17:56:31.4390919Z ?? third_party/pocketfft/'
2021-09-10T17:56:31.4391301Z ?? third_party/breakpad/
2021-09-10T17:56:31.4391670Z ?? third_party/cudnn_frontend/
2021-09-10T17:56:31.4392060Z ?? third_party/pocketfft/
2021-09-10T17:56:31.4392376Z + exit 1
2021-09-10T17:56:31.4392656Z + cleanup

See CircleCI build pytorch_macos_10_13_py3_test (2/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Sep 10 18:40:41 ERROR [0.004s]: test_poisson_sample (__main__.TestDistributions)
Sep 10 18:40:41   File "distributions/test_distributions.py", line 805, in _check_sampler_discrete
Sep 10 18:40:41     chisq, p = scipy.stats.chisquare(counts[msk], pmf[msk] * num_samples)
Sep 10 18:40:41   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py", line 6853, in chisquare
Sep 10 18:40:41     lambda_="pearson")
Sep 10 18:40:41   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py", line 6694, in power_divergence
Sep 10 18:40:41     raise ValueError(msg)
Sep 10 18:40:41 ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
Sep 10 18:40:41 0.008265582255680495
Sep 10 18:40:41 
Sep 10 18:40:41 ======================================================================
Sep 10 18:40:41 ERROR [0.004s]: test_poisson_sample (__main__.TestDistributions)
Sep 10 18:40:41 ----------------------------------------------------------------------
Sep 10 18:40:41 Traceback (most recent call last):
Sep 10 18:40:41   File "distributions/test_distributions.py", line 1333, in test_poisson_sample
Sep 10 18:40:41     failure_rate=1e-3)
Sep 10 18:40:41   File "distributions/test_distributions.py", line 805, in _check_sampler_discrete
Sep 10 18:40:41     chisq, p = scipy.stats.chisquare(counts[msk], pmf[msk] * num_samples)
Sep 10 18:40:41   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py", line 6853, in chisquare
Sep 10 18:40:41     lambda_="pearson")
Sep 10 18:40:41   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py", line 6694, in power_divergence
Sep 10 18:40:41     raise ValueError(msg)

1 failure not recognized by patterns:

Job Step Action
GitHub Actions Linux CI (pytorch-linux-xenial-py3.6-gcc5.4) / render_test_results Download PyTorch Test Reports 🔁 rerun

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@cbalioglu cbalioglu changed the title (torch.distributed) Add torch.distributed.is_torchelastic_launched() … (torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910) Sep 10, 2021
@cbalioglu cbalioglu added this to the 1.9.1 milestone Sep 10, 2021
@cbalioglu cbalioglu self-assigned this Sep 10, 2021
…util method + make init_method=tcp:// compatible with torchelastic (#63910)

Summary:
Pull Request resolved: #63910

Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such:

```
$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py
```

An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.

For details see: #63874.

This change does a couple of things:

1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic.
1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function.
1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0).
1. Adds a bunch of unittests to cover the different code paths

NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.

Test Plan: Unittests.

Reviewed By: cbalioglu

Differential Revision: D30529984

fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5
@malfet malfet merged commit e2cb357 into release/1.9 Sep 12, 2021
@malfet malfet deleted the backport-63910 branch September 12, 2021 17:38
@cbalioglu cbalioglu removed their assignment Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[torch.distributed.run] Assert (and fail fast w/ message) init_method=="env://" on the trainer

4 participants