[70B-Part3] FSDP Training #109

farzadab · 2024-09-10T23:06:03Z

This PR is mostly complete, but I do need to revisit some minor assumptions to make sure I didn't break the normal (non-fsdp) pipeline.

One of the odd changes here is the removal of .to(device, dtype).

The dtype part is moved into the model itself since it'll cost us too much to possibly be sloppy and load a full-precision 70B and then turn it into half precision.
The device part is being handled by the trainer itself when doing trainer.train. This causes some minor issues (e.g. running trainer.evaluate before is not allowed), but they're not hugely important.
- note that now model.device is cpu (or mps) before trainer.train and then cuda:rank after trainer.train. This might take some getting used to.

ultravox/training/train.py

ultravox/model/ultravox_model.py

ultravox/training/config_base.py

ultravox/training/train.py

farzadab · 2024-09-16T18:36:12Z

@juberti are there any more comments?

ultravox/model/ultravox_model.py

ultravox/training/config_base.py

ultravox/training/train.py

* use_fsdp option * return move to(device) when not using FSDP

use_fsdp option

63b4468

farzadab requested review from juberti, liPatrick and zqhuang211 September 10, 2024 23:06

farzadab commented Sep 10, 2024

View reviewed changes

ultravox/training/train.py Show resolved Hide resolved

juberti reviewed Sep 12, 2024

View reviewed changes

farzadab added 3 commits September 13, 2024 15:51

return move to(device) when not using FSDP

041d2e6

Merge branch 'main' into farzad-fsdp-p3

0f97838

formatting

b88422c

juberti approved these changes Sep 16, 2024

View reviewed changes

ultravox/model/ultravox_model.py Outdated Show resolved Hide resolved

ultravox/training/config_base.py Show resolved Hide resolved

ultravox/training/train.py Show resolved Hide resolved

ultravox/training/train.py Show resolved Hide resolved

farzadab added 2 commits September 16, 2024 15:05

comments from pr review

702d1c8

Merge branch 'main' into farzad-fsdp-p3

7849bb9

farzadab enabled auto-merge (squash) September 16, 2024 23:52

farzadab merged commit be8ee6b into main Sep 16, 2024
1 check passed

farzadab deleted the farzad-fsdp-p3 branch September 17, 2024 00:11

akshat0311 pushed a commit to jiviai/audio-llm that referenced this pull request Jan 30, 2025

[70B-Part3] FSDP Training (fixie-ai#109)

0e4d957

* use_fsdp option * return move to(device) when not using FSDP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[70B-Part3] FSDP Training #109

[70B-Part3] FSDP Training #109

Uh oh!

farzadab commented Sep 10, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

farzadab commented Sep 16, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[70B-Part3] FSDP Training #109

[70B-Part3] FSDP Training #109

Uh oh!

Conversation

farzadab commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

farzadab commented Sep 16, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

farzadab commented Sep 10, 2024 •

edited

Loading