Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fix ZenFlow ZeRO-3 selective optimizer crash with parameter offload on nvme#8042

Merged
tohtana merged 2 commits into
masterfrom
zenflow-stage3-nvme-support
Jun 4, 2026
Merged

Fix ZenFlow ZeRO-3 selective optimizer crash with parameter offload on nvme#8042
tohtana merged 2 commits into
masterfrom
zenflow-stage3-nvme-support

Conversation

@Antlera
Copy link
Copy Markdown
Collaborator

@Antlera Antlera commented Jun 2, 2026

This PR fixes a crash in ZenFlowSelectiveAdamW_stage3 when ZeRO-3 offloads parameters to NVMe or CPU.

  • Detect offloaded partitions (a 0-dim NVMe placeholder, or a partition on a device other than the gradients') and update them through a per-parameter path: swap each NVMe partition in and out one at a time, run AdamW on the compute device, and write the result back to where the partition lives.
  • Move selected_indices to the partition's device in temp_copy_param, and skip the resident pre-write in the offload bucket flush.
  • Leave the existing batched path unchanged for GPU-resident partitions.
  • Add unit tests covering the swap-in/update/swap-out path.

Root Cause

The selective optimizer updates each bf16 partition in place through param.ds_tensor.data, assuming it is resident on the compute device. When a partition is offloaded to NVMe, ds_tensor.data is a 0-dim placeholder, so narrow() raises "narrow() cannot be applied to a 0-dim tensor"; when it is on CPU it lives on a different device than the selected gradients, so indexing raises a device-mismatch error.

Fixes #7686

This PR fixes a crash in `ZenFlowSelectiveAdamW_stage3` when ZeRO-3 offloads
parameters to NVMe or CPU.

- Detect offloaded partitions (a 0-dim NVMe placeholder, or a partition on a
  device other than the gradients') and update them through a per-parameter
  path: swap each NVMe partition in and out one at a time, run AdamW on the
  compute device, and write the result back to where the partition lives.
- Move `selected_indices` to the partition's device in `temp_copy_param`, and
  skip the resident pre-write in the offload bucket flush.
- Leave the existing batched path unchanged for GPU-resident partitions.
- Add unit tests covering the swap-in/update/swap-out path.

## Root Cause

The selective optimizer updates each bf16 partition in place through
`param.ds_tensor.data`, assuming it is resident on the compute device. When a
partition is offloaded to NVMe, `ds_tensor.data` is a 0-dim placeholder, so
`narrow()` raises "narrow() cannot be applied to a 0-dim tensor"; when it is on
CPU it lives on a different device than the selected gradients, so indexing
raises a device-mismatch error.

Fixes #7686

Signed-off-by: Tingfeng Lan <[email protected]>
@Antlera
Copy link
Copy Markdown
Collaborator Author

Antlera commented Jun 2, 2026

Hi @tjruwase @tohtana. Try to fix #7686 by enabling nvme offload for zenflow. I have validated the loss curve. It is similar from my side. Could you please help review when you got a chance. Thanks!

Copy link
Copy Markdown
Collaborator

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @Antlera!
This looks good to me.

@tohtana tohtana enabled auto-merge (squash) June 4, 2026 16:53
@tohtana tohtana merged commit 28a196f into master Jun 4, 2026
12 checks passed
@tohtana tohtana deleted the zenflow-stage3-nvme-support branch June 4, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Zenflow_stage3 - RuntimeError: narrow() cannot be applied to a 0-dim tensor.

2 participants