Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat: add async model average algorithm#110

Merged
NOBLES5E merged 34 commits into
masterfrom
feat/async-model-average
Aug 23, 2021
Merged

feat: add async model average algorithm#110
NOBLES5E merged 34 commits into
masterfrom
feat/async-model-average

Conversation

@NOBLES5E
Copy link
Copy Markdown
Contributor

@NOBLES5E NOBLES5E commented Jul 7, 2021

No description provided.

@pr-triage pr-triage Bot added the PR: draft label Jul 7, 2021
Comment thread bagua/torch_api/algorithms/async_model_average.py
Comment thread bagua/torch_api/algorithms/async_model_average.py
Comment thread bagua/torch_api/algorithms/async_model_average.py Outdated
Comment thread bagua/torch_api/bucket.py Outdated
wangraying and others added 2 commits August 6, 2021 17:03
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@wangraying wangraying requested a review from a team August 6, 2021 10:13
@wangraying wangraying marked this pull request as ready for review August 6, 2021 10:13
Comment thread bagua/torch_api/algorithms/async_model_average.py
Comment thread bagua/torch_api/algorithms/async_model_average.py
wangraying and others added 2 commits August 6, 2021 18:48
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Comment thread bagua/torch_api/algorithms/async_model_average.py Outdated
wangraying and others added 4 commits August 6, 2021 21:27
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Comment thread bagua/torch_api/algorithms/__init__.py Outdated
Comment thread tests/torch_api/test_async_model_average.py
wangraying and others added 3 commits August 12, 2021 23:02
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@wangraying
Copy link
Copy Markdown
Member

wangraying commented Aug 19, 2021

update @NOBLES5E

It seems using LL128 protocol will cause the problem to hang. I have reproduced the problem through a sample code backend by torch+cupy. This is consistent with the results on Bagua.

Using the other two protocols LL and Simple does not hang as far as I know.

I have posted an issue on github, which we could track later on.

@pr-triage pr-triage Bot removed the PR: draft label Aug 20, 2021
@todo
Copy link
Copy Markdown

todo Bot commented Aug 20, 2021

remove nccl proto check

# TODO: remove nccl proto check
proto_str = os.environ.get("NCCL_PROTO", "")
if (
proto_str == ""
or ("^" not in proto_str and "LL128" in proto_str)
or ("^" in proto_str and "LL128" not in proto_str)


This comment was generated by todo based on a TODO comment in f550bcd in #110. cc @BaguaSys.

Comment thread bagua/torch_api/algorithms/__init__.py Outdated
@wangraying
Copy link
Copy Markdown
Member

@wangraying fix CI so that we can merge this
We also need a tutorial page on this algorithm

This CI seems broken.

@wangraying
Copy link
Copy Markdown
Member

wangraying commented Aug 20, 2021

And do you have any further comment about the usage, since we add a function barrier to end async threads.

@NOBLES5E

@todo
Copy link
Copy Markdown

todo Bot commented Aug 23, 2021

; remove this after NVIDIA/nccl#549 gets solved

) # TODO; remove this after https://github.com/NVIDIA/nccl/issues/549 gets solved
class AsyncModelAverageAlgorithm(Algorithm):
def __init__(
self, peer_selection_mode: str = "all", sync_interval_ms: int = 500,


This comment was generated by todo based on a TODO comment in 1c33cf9 in #110. cc @BaguaSys.

Comment thread bagua/torch_api/algorithms/async_model_average.py Outdated
Comment thread bagua/torch_api/algorithms/async_model_average.py Outdated
Comment thread bagua/torch_api/algorithms/async_model_average.py
Copy link
Copy Markdown
Contributor Author

@NOBLES5E NOBLES5E left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refined the texts and left some comments. @wangraying Please do a final check

Comment thread bagua/torch_api/bucket.py Outdated
NOBLES5E and others added 3 commits August 22, 2021 20:37
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@NOBLES5E NOBLES5E merged commit fcef9ef into master Aug 23, 2021
@NOBLES5E NOBLES5E deleted the feat/async-model-average branch August 23, 2021 04:53
@todo todo Bot mentioned this pull request Aug 23, 2021
@pr-triage pr-triage Bot added the PR: merged label Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants