Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Revert deepspeed z3 regressions #37315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

winglian
Copy link
Contributor

@winglian winglian commented Apr 6, 2025

What does this PR do?

#36963 #37281 #37306 causes regressions with training with deepspeed zero3, see our axolotl integration tests on the latest 4.51.0 release that includes these commits/PRs that all fail with zero3 https://github.com/axolotl-ai-cloud/axolotl/actions/runs/14286223137/job/40041643515

@LysandreJik seems to agree that these deepspeed changes should be reverted

/cc @stas00

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@LysandreJik @SunMarc @zach-huggingface
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@github-actions github-actions bot marked this pull request as draft April 6, 2025 00:28
Copy link

github-actions bot commented Apr 6, 2025

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@winglian winglian marked this pull request as ready for review April 6, 2025 01:19
@github-actions github-actions bot requested review from ArthurZucker and ydshieh April 6, 2025 01:19
@winglian
Copy link
Contributor Author

winglian commented Apr 6, 2025

Our multi-gpu integration CI is passing for Zero-3 with this branch

@stas00
Copy link
Contributor

stas00 commented Apr 6, 2025

@winglian, I think you want to tag @Cyrilvallez and @ArthurZucker.

Does my proposal here helps?
#37281 (comment)
It's incomplete but overcomes the earlier version's copying from meta device, as you can see in my comments - this is trying to help to channel all loading into the same code path. But Cyril said it's incomplete and more work is needed.

@stas00
Copy link
Contributor

stas00 commented Apr 6, 2025

Also if you have resources could you please add the missing test to tests/deepspeed/* - to match your reported failure - because clearly your use case isn't being tested, and so the test suite is as good as its coverage.

Please be mindful of this gotcha #37281 (comment) - i.e. make sure the test model is a .safetensor type.

@winglian
Copy link
Contributor Author

winglian commented Apr 6, 2025

#37324 fixes this, We'll reopen a new PR with test cases for zero3

@winglian winglian closed this Apr 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants