Codestin Search App

alpha0422 · 2024-04-11T00:58:56Z

This PR enhances distributed fused adam by:

Support NHWC layout (required by some Conv related models, e.g. Diffusion models);
Fix the gradient clipping bug;
Support CUDA graph;

@timmoon10 @crcrpar Please help review, thanks.

crcrpar · 2024-04-11T01:43:18Z

+    int chunk_idx = tl.block_to_chunk[blockIdx.x];
+    int n = tl.sizes[tensor_loc];
+
+    const float grad_scale = *grad_scale_ptr;


nit: might better to check whether or not grad_scale_ptr is nullptr

Added some assertion at the beginning.

crcrpar · 2024-04-11T01:44:29Z

+        local_p_out[ii] = static_cast<PARAM_OUT_T>(local_p[ii]);
+      }
+
+      // Store


would there be any appetite to use gradients after step? if so, it'd be necessary to store unscaled gradients as well.

We haven't encounter such cases yet, after optimizer stepping a new iteration starts, where the gradients will be zero out first. This is the same behavior as other optimizer kernels in the repo, so I think we can leave it as it is until there're cases we need to store unscaled gradients in future.

crcrpar · 2024-04-11T01:46:57Z

+        self.state["step"] += 1 if not self.capturable else \
+            (self._dummy_overflow_buf != 1).to(torch.int)


Q: where would we decrement this value when self.capturable is True and invalid grads are found?

As you know self.state["step"] is to track how many steps the optimizer has advanced, it is used for bias correction in the CUDA kernel. When invalid grads are found, self._dummy_overflow_buf is 1, then it's self.state["step"] += 0, otherwise it's self.state["step"] += 1. We don't need to decrement it in such form.

uh, I misread it, thank you for correcting me.

another question: would this be really host-device sync free?

Yes, in this case distributed fused adam is sync-free.

timmoon10

As we've discussed, the grad clipping behavior is incorrect because plain PyTorch optimizers don't handle grad scaling gracefully:

# Plain PyTorch
torch.nn.clip_grad_norm_(model.parameters())
scaler.step(optim)  # Clipped grads are scaled

# Distributed optimizer
optim.clip_grad_norm()
scaler.step(optim)  # Clipped grads are scaled

I'd prefer if distopt were as close as possible to a drop-in optimizer replacement, so I don't think the current behavior should be changed.

Supporting correct grad clipping is important though. I propose the following API:

# Plain PyTorch
scaler.unscale_(optim)
torch.nn.clip_grad_norm_(model.parameters())
scaler.step(optim)  # Grads are not scaled

# Distributed optimizer
optim.unscale_grads(grad_scaler=scaler)
optim.clip_grad_norm()
scaler.step(optim)  # Grads are not scaled

alpha0422 · 2024-04-11T06:24:53Z

@timmoon10, I like the idea of drop-in optimizer replacement. Right now, distributed fused adam sets _step_supports_amp_scaling, so scaler.unscale_(optim) or optim.unscale_grads(grad_scaler=scaler) won't be called from PyTorch or PyTorch Lightning, because the assumption of _step_supports_amp_scaling is the gradient unscaling will be done in the optimizer step function, thus gradient clipping need to be delayed to the optimizer step function too.

To support the idea you mentioned, I need _step_supports_amp_scaling need to be removed, but then I think it will break other use cases, and it will decrease the performance because gradient unscaling is explicit and not fused with the step kernel.

timmoon10 · 2024-04-11T22:11:13Z

I've implemented my proposed API at timmoon10@0fa8e3a, although I haven't been able to test yet.

NeMo GPT avoided these issues because it implemented a custom GradScaler that called DistributedFusedAdam.unscale_grads within GradScaler.unscale_: https://github.com/NVIDIA/NeMo/blob/c5738263d8b4bedb0957374116d3e90746a51c37/nemo/collections/nlp/parts/nlp_overrides.py#L1235. See #1512 and NVIDIA-NeMo/NeMo#4900.

_step_supports_amp_scaling is needed because otherwise GradScaler.unscale_ would attempt to access the parameters' .grads, which have probably already been reduce-scattered and set to None. The only way I can see to avoid this is to disable overlapping grad reduce-scatters with backward compute.

alpha0422 · 2024-04-12T01:20:04Z

But there's also the issue when _step_supports_amp_scaling set, GradScaler.unscale_ will never be called from PyTorch or PyTorch Lightning. I saw you tried to unscale at here: nlp_overrides.py#L1202, but this function was never called, I confirmed with Stable Diffusion and LLM.

Overlapping reduce-scatter with bprop is quite important to the performance, so I think it is necessary.

timmoon10 · 2024-04-12T03:19:39Z

I see, we need _step_support_amp_scaling=False specifically when using nemo.collections.nlp.parts.nlp_overrides.GradScaler. However, _step_support_amp_scaling=True is needed for correct behavior with torch.amp.GradScaler. I think the cleanest solution is to set _step_support_amp_scaling=False in NeMo's distopt wrapper. That helps keep the NeMo-specific logic separate from the general PyTorch logic in Apex. Reverting the changes to the grad clipping logic (e.g. with timmoon10@0fa8e3a) is needed to preserve correct behavior with torch.amp.GradScaler.

… copy after all-gather.

crcrpar reviewed Apr 11, 2024

View reviewed changes

timmoon10 suggested changes Apr 11, 2024

View reviewed changes

timmoon10 mentioned this pull request Apr 12, 2024

Fix Distributed Fused Adam Issues NVIDIA-NeMo/NeMo#8880

Merged

8 tasks

alpha0422 marked this pull request as draft April 12, 2024 08:03

alpha0422 marked this pull request as ready for review April 25, 2024 11:26

alpha0422 mentioned this pull request Apr 25, 2024

Enhance Distributed Adam NVIDIA-NeMo/NeMo#9037

Merged

8 tasks

alpha0422 added 9 commits April 25, 2024 21:36

Support NHWC for distributed fused adam.

5196c14

Fix the gradient clipping bug with distributed adam.

3365ad6

Support CUDA graph for distributed fused adam.

10af2d0

Make sure key pointers are valid.

6923d24

Better repr for distributed adam.

558a3ff

Warn if capturable is set but deprecated fused adam is not found.

fcea0e1

Preserve memory format in parameter buffer of distributed adam.

b195f64

Preserve memory format during parameter copy.

09c934c

Fix the contiguous_param_buffer bug about bprop overlap and redundant…

c466e70

… copy after all-gather.

alpha0422 force-pushed the wkong/dist-adam branch from 0304852 to c466e70 Compare April 25, 2024 13:39

Fix the bug that process group is not set.

0213ada

Aidyn-A changed the base branch from master to 24.04.01-devel April 28, 2024 04:50

Aidyn-A merged commit 4138d31 into NVIDIA:24.04.01-devel Apr 28, 2024

github-actions Bot mentioned this pull request Apr 28, 2024

Enhance Distributed Adam NVIDIA-NeMo/NeMo#9051

Merged

8 tasks

alpha0422 mentioned this pull request Aug 22, 2024

Enhance Distributed Fused Adam #1832

Merged

		self.state["step"] += 1 if not self.capturable else \
		(self._dummy_overflow_buf != 1).to(torch.int)

Conversation

alpha0422 commented Apr 11, 2024

Uh oh!

crcrpar Apr 11, 2024

Choose a reason for hiding this comment

Uh oh!

alpha0422 Apr 11, 2024

Choose a reason for hiding this comment

Uh oh!

crcrpar Apr 11, 2024

Choose a reason for hiding this comment

Uh oh!

alpha0422 Apr 11, 2024

Choose a reason for hiding this comment

Uh oh!

crcrpar Apr 11, 2024

Choose a reason for hiding this comment

Uh oh!

alpha0422 Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crcrpar Apr 11, 2024

Choose a reason for hiding this comment

Uh oh!

alpha0422 Apr 11, 2024

Choose a reason for hiding this comment

Uh oh!

timmoon10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alpha0422 commented Apr 11, 2024

Uh oh!

timmoon10 commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alpha0422 commented Apr 12, 2024

Uh oh!

timmoon10 commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alpha0422 Apr 11, 2024 •

edited

Loading

timmoon10 left a comment •

edited

Loading

timmoon10 commented Apr 11, 2024 •

edited

Loading

timmoon10 commented Apr 12, 2024 •

edited

Loading