sd: fix VAE tiled fallback VRAM leak #10139

rattus128 · 2025-10-01T08:06:12Z

When the VAE catches this VRAM OOM, it launches the tiler fallback logic straight from the exception context.

Python however refs the entire call stack that caused the exception including any local variables for the sake of exception report and debugging. In the case of tensors, this can hold on the references to GBs of VRAM and inhibit the VRAM allocator from freeing them.

So dump the except context completely before going back to the VAE via the tiler by getting out of the except block with nothing but a flag.

The greatly increases the reliability of the tiler fallback, especially on low VRAM cards, as with the bug, if the leak randomly leaked more than the headroom needed for a single tile, the tiler fallback would OOM and fail the flow.

Test conditions:

768x768x13f WAN 2.1 VAE Encode using regular VAE encode (latent saved to file to terminate the flow)
NVIDIA GeForce GTX 1660 SUPER (6GB)
python main.py --novram --disable-cuda-malloc
(disable cuda malloc is needed for VRAM tracing)

Here is the VRAM usage over time before the fix:

The first big peak on the left is the attempt to do it untiled that OOMs. The repeat clusters of 4 little peaks thereafter are the individual tiles. Each peak is a latent frame (4 latent frame for 13f encode). Those giant horizontal bars under the little peaks are the leak.

With this change:

No more giant bars and the tiler has the full GPU VRAM to work with.

NOTE: Prints of the torch VRAM usage confirm the bug is independent of the --disable-cuda-malloc flag.

Test instrumentation diff:

--- a/comfy/sd.py
+++ b/comfy/sd.py
@@ -702,6 +702,7 @@ class VAE:
         return output.movedim(1, -1)
 
     def encode(self, pixel_samples):
+        torch.cuda.memory._record_memory_history()
         self.throw_exception_if_invalid()
         pixel_samples = self.vae_encode_crop_pixels(pixel_samples)
         pixel_samples = pixel_samples.movedim(-1, 1)
@@ -743,6 +744,7 @@ class VAE:
             else:
                 samples = self.encode_tiled_(pixel_samples)
 
+        torch.cuda.memory._dump_snapshot("memory_trace.pickle")
         return samples
 
     def encode_tiled(self, pixel_samples, tile_x=None, tile_y=None, overlap=None, tile_t=None, overlap_t=None):

Kosinkadink · 2025-10-01T08:45:25Z

Nice! Will try to get this reviewed and merged Wednesday afternoon PST.

comfy/sd.py

When the VAE catches this VRAM OOM, it launches the fallback logic straight from the exception context. Python however refs the entire call stack that caused the exception including any local variables for the sake of exception report and debugging. In the case of tensors, this can hold on the references to GBs of VRAM and inhibit the VRAM allocated from freeing them. So dump the except context completely before going back to the VAE via the tiler by getting out of the except block with nothing but a flag. The greately increases the reliability of the tiler fallback, especially on low VRAM cards, as with the bug, if the leak randomly leaked more than the headroom needed for a single tile, the tiler would fallback would OOM and fail the flow.

rattus128 requested a review from Kosinkadink as a code owner October 1, 2025 08:06

rattus128 force-pushed the prs/tiled-oom branch from e57d460 to 06ebf26 Compare October 1, 2025 08:11

chaObserv reviewed Oct 1, 2025

View reviewed changes

comfy/sd.py Outdated Show resolved Hide resolved

rattus128 force-pushed the prs/tiled-oom branch from 06ebf26 to 521c4da Compare October 1, 2025 10:51

rattus128 requested a review from chaObserv October 1, 2025 10:51

Kosinkadink added the Good PR This PR looks good to go, it needs comfy's final review. label Oct 1, 2025

comfyanonymous merged commit 911331c into comfyanonymous:master Oct 1, 2025
13 of 14 checks passed

quanzs538 mentioned this pull request Oct 3, 2025

打开comfyui-n.bat后就一直显示下面页面不动了,这个状态是让我干什么,我下一步该怎么做 patientx/ComfyUI-Zluda#324

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sd: fix VAE tiled fallback VRAM leak #10139

sd: fix VAE tiled fallback VRAM leak #10139

Uh oh!

rattus128 commented Oct 1, 2025 •

edited

Loading

Uh oh!

Kosinkadink commented Oct 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sd: fix VAE tiled fallback VRAM leak #10139

sd: fix VAE tiled fallback VRAM leak #10139

Uh oh!

Conversation

rattus128 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kosinkadink commented Oct 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rattus128 commented Oct 1, 2025 •

edited

Loading