Codestin Search App

Antlera · 2026-06-02T01:30:09Z

This PR fixes a crash in ZenFlowSelectiveAdamW_stage3 when ZeRO-3 offloads parameters to NVMe or CPU.

Detect offloaded partitions (a 0-dim NVMe placeholder, or a partition on a device other than the gradients') and update them through a per-parameter path: swap each NVMe partition in and out one at a time, run AdamW on the compute device, and write the result back to where the partition lives.
Move selected_indices to the partition's device in temp_copy_param, and skip the resident pre-write in the offload bucket flush.
Leave the existing batched path unchanged for GPU-resident partitions.
Add unit tests covering the swap-in/update/swap-out path.

Root Cause

The selective optimizer updates each bf16 partition in place through param.ds_tensor.data, assuming it is resident on the compute device. When a partition is offloaded to NVMe, ds_tensor.data is a 0-dim placeholder, so narrow() raises "narrow() cannot be applied to a 0-dim tensor"; when it is on CPU it lives on a different device than the selected gradients, so indexing raises a device-mismatch error.

Fixes #7686

This PR fixes a crash in `ZenFlowSelectiveAdamW_stage3` when ZeRO-3 offloads parameters to NVMe or CPU. - Detect offloaded partitions (a 0-dim NVMe placeholder, or a partition on a device other than the gradients') and update them through a per-parameter path: swap each NVMe partition in and out one at a time, run AdamW on the compute device, and write the result back to where the partition lives. - Move `selected_indices` to the partition's device in `temp_copy_param`, and skip the resident pre-write in the offload bucket flush. - Leave the existing batched path unchanged for GPU-resident partitions. - Add unit tests covering the swap-in/update/swap-out path. ## Root Cause The selective optimizer updates each bf16 partition in place through `param.ds_tensor.data`, assuming it is resident on the compute device. When a partition is offloaded to NVMe, `ds_tensor.data` is a 0-dim placeholder, so `narrow()` raises "narrow() cannot be applied to a 0-dim tensor"; when it is on CPU it lives on a different device than the selected gradients, so indexing raises a device-mismatch error. Fixes #7686 Signed-off-by: Tingfeng Lan <[email protected]>

Antlera · 2026-06-02T01:36:03Z

Hi @tjruwase @tohtana. Try to fix #7686 by enabling nvme offload for zenflow. I have validated the loss curve. It is similar from my side. Could you please help review when you got a chance. Thanks!

tohtana

Thank you, @Antlera!
This looks good to me.

Antlera requested review from loadams, tjruwase and tohtana as code owners June 2, 2026 01:30

Merge branch 'master' into zenflow-stage3-nvme-support

8548790

tohtana approved these changes Jun 4, 2026

View reviewed changes

tohtana enabled auto-merge (squash) June 4, 2026 16:53

tohtana merged commit 28a196f into master Jun 4, 2026
12 checks passed

tohtana deleted the zenflow-stage3-nvme-support branch June 4, 2026 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ZenFlow ZeRO-3 selective optimizer crash with parameter offload on nvme#8042

Fix ZenFlow ZeRO-3 selective optimizer crash with parameter offload on nvme#8042
tohtana merged 2 commits into
masterfrom
zenflow-stage3-nvme-support

Antlera commented Jun 2, 2026

Uh oh!

Antlera commented Jun 2, 2026 •

edited

Loading

Uh oh!

tohtana left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Antlera commented Jun 2, 2026

Root Cause

Uh oh!

Antlera commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Antlera commented Jun 2, 2026 •

edited

Loading