Fix ZenFlow ZeRO-3 selective optimizer crash with parameter offload on nvme#8042
Merged
Conversation
This PR fixes a crash in `ZenFlowSelectiveAdamW_stage3` when ZeRO-3 offloads parameters to NVMe or CPU. - Detect offloaded partitions (a 0-dim NVMe placeholder, or a partition on a device other than the gradients') and update them through a per-parameter path: swap each NVMe partition in and out one at a time, run AdamW on the compute device, and write the result back to where the partition lives. - Move `selected_indices` to the partition's device in `temp_copy_param`, and skip the resident pre-write in the offload bucket flush. - Leave the existing batched path unchanged for GPU-resident partitions. - Add unit tests covering the swap-in/update/swap-out path. ## Root Cause The selective optimizer updates each bf16 partition in place through `param.ds_tensor.data`, assuming it is resident on the compute device. When a partition is offloaded to NVMe, `ds_tensor.data` is a 0-dim placeholder, so `narrow()` raises "narrow() cannot be applied to a 0-dim tensor"; when it is on CPU it lives on a different device than the selected gradients, so indexing raises a device-mismatch error. Fixes #7686 Signed-off-by: Tingfeng Lan <[email protected]>
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes a crash in
ZenFlowSelectiveAdamW_stage3when ZeRO-3 offloads parameters to NVMe or CPU.selected_indicesto the partition's device intemp_copy_param, and skip the resident pre-write in the offload bucket flush.Root Cause
The selective optimizer updates each bf16 partition in place through
param.ds_tensor.data, assuming it is resident on the compute device. When a partition is offloaded to NVMe,ds_tensor.datais a 0-dim placeholder, sonarrow()raises "narrow() cannot be applied to a 0-dim tensor"; when it is on CPU it lives on a different device than the selected gradients, so indexing raises a device-mismatch error.Fixes #7686