Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@zty-king
Copy link
Contributor

@zty-king zty-king commented Jun 4, 2025

PR Category

Auto Parallel

PR Types

Bug fixes

Description

ProcessMesh 的 get_group 方法,在实际使用时会重复创建通信组,导致显存爆炸,或者通信过程中会存在非预期的错误。因此在通过 ProcessMesh 转换为动手使用的通信组是得慎重,若此时已经存在与mesh.get_group方法相同mesh的group,使用get_group 应该获取该group,而不是直接创建新的通信组。

@paddle-bot
Copy link

paddle-bot bot commented Jun 4, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Jun 4, 2025
@zty-king
Copy link
Contributor Author

zty-king commented Jun 6, 2025

本地单测测试结果如图所示:
image
分别测试效果如下:(验证代码被覆盖)
image
image
image
image
image
image

@paddle-ci-bot
Copy link

paddle-ci-bot bot commented Jun 15, 2025

Sorry to inform you that f056bd8's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@codecov-commenter
Copy link

codecov-commenter commented Jul 24, 2025

Codecov Report

❌ Patch coverage is 57.14286% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@0f3860d). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...n/paddle/distributed/auto_parallel/process_mesh.py 57.14% 6 Missing ⚠️

❌ Your patch status has failed because the patch coverage (57.14%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #73099   +/-   ##
==========================================
  Coverage           ?   57.14%           
==========================================
  Files              ?        1           
  Lines              ?       14           
  Branches           ?        0           
==========================================
  Hits               ?        8           
  Misses             ?        6           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@xuxinyi389 xuxinyi389 changed the title Fix get group [AutoParallel]Fix get group Jul 29, 2025
)

return parallel_group_map[dim_name]()
existing_group = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

冗余变量,在if set(group.ranks) == set(self._process_ids)分支下直接返回 group就好

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done



if __name__ == "__main__":
test_dp_parallel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fleet_test_xx,都是类似的文件,是否可以想办法合并成一个文件

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@zty-king
Copy link
Contributor Author

/re-run all-failed

@zty-king
Copy link
Contributor Author

补充本地测试覆盖效果:

  1. 已存在的group(非hybrid_communicate_group):
image image 2. 不存在的group: image image

@xuxinyi389 xuxinyi389 changed the title [AutoParallel]Fix get group [AutoParallel]Fix get_group method of processmesh Jul 31, 2025
Copy link
Contributor

@xuxinyi389 xuxinyi389 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xuxinyi389 xuxinyi389 merged commit 98d5956 into PaddlePaddle:develop Jul 31, 2025
112 of 125 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants