Codestin Search App

awni · 2025-12-04T00:52:15Z

There are a couple cases where MLX can fail with OOM when there is actually memory available:

Sometimes cudaMallocAsync returns nullptr even when there should be enough RAM outside the cache + used memory. I believe this is due to fragmentation. Instead of failing on this case, we free from the cache then try again.
Sometimes kernel / graph execution fails due to OOM (very curious here what could cause that). If the OS reported free memory is below a limit then we clear from the cache if possible.

awni · 2025-12-04T03:40:24Z

I think there is a performance issue here so moving into draft.

awni · 2025-12-04T03:47:04Z

Evidently calling CHECK_CUDA_ERROR(cudaMemGetInfo(&free, &total)); in malloc is a bad idea :(.

awni · 2025-12-05T23:55:50Z

Ok I fixed this and I don't see a regression in perf.

I think the basic premise for what is happening is that even though the MLX cache + active memory is well under the limit, there is fragmentation and since we are using async free, the device is not able return memory to the OS before every time we call malloc, and so CUDA can fail to allocate free memory even when the total amount of free memory exceeds the requested allocation.

awni · 2025-12-05T23:56:39Z

I uploaded a script that repro's the issue on B200 (and should on H100 for smaller batch size). Just leaving it here for reference.

run.py

zcbenz

Looks good to me!

zcbenz · 2025-12-06T00:15:31Z

mlx/backend/cuda/allocator.cpp

+  return loc;
+}
+#else
+int cuda_mem_loc(int i) {


nitpick: add inline.

zcbenz · 2025-12-06T00:37:00Z

mlx/backend/cuda/allocator.cpp

+        size_t used = 0;
+        CHECK_CUDA_ERROR(cudaMemPoolGetAttribute(
+            p, cudaMemPoolAttrReservedMemCurrent, &used));
+        if (used > (total_memory_ - free_limit_)) {


Why having a free_limit_? The code would read easier for me if it is just:

if (used > memory_limit_) { buffer_cache_.release_cached_buffers(total_memory_ - memory_limit_); }

Good question. memory_limit_ can change (the user can set the memory limit to be higher or lower). I wanted a value that was fixed based on the total device memory.

What do you think about using hard_memory_limit_/soft_memory_limit_? (Just being nitpick, I'm good with free_limit_ too.)

I don't really love hard_memory_limit cause it's not a hard limit.

It's more like a soft memory limit on the underlying cuda pool. I'll think a bit more on how to phrase it.

The other thing that is a bit of a mess in our allocator especially is how we deal with multi-device on a discrete setup where each device has it's own memory.

I think at some point it might make sense to have separate buffer cache for each device and one for the managed allocator.

That makes sense!

awni force-pushed the improve_cuda_allocator branch from 5eb48e1 to 158accb Compare December 4, 2025 00:54

awni marked this pull request as draft December 4, 2025 03:40

awni force-pushed the improve_cuda_allocator branch 3 times, most recently from c4b20cd to 7a09a4a Compare December 4, 2025 19:37

try not to fail when there should be memory available

0dfab41

awni force-pushed the improve_cuda_allocator branch 2 times, most recently from c5ff2e2 to 3a4b72d Compare December 5, 2025 23:51

speed up mem check

b3d6566

awni force-pushed the improve_cuda_allocator branch from 3a4b72d to b3d6566 Compare December 5, 2025 23:52

awni marked this pull request as ready for review December 5, 2025 23:55

awni requested review from angeloskath and zcbenz December 5, 2025 23:56

zcbenz approved these changes Dec 6, 2025

View reviewed changes

comment

db9a71b

awni merged commit a4b3bc9 into ml-explore:main Dec 7, 2025
12 checks passed

awni deleted the improve_cuda_allocator branch December 9, 2025 14:19

BrewTestBot mentioned this pull request Dec 18, 2025

mlx 0.30.1 Homebrew/homebrew-core#259125

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try not to fail when there should be memory available#2869

Try not to fail when there should be memory available#2869
awni merged 3 commits intoml-explore:mainfrom
awni:improve_cuda_allocator

awni commented Dec 4, 2025

Uh oh!

awni commented Dec 4, 2025

Uh oh!

awni commented Dec 4, 2025

Uh oh!

awni commented Dec 5, 2025

Uh oh!

awni commented Dec 5, 2025

Uh oh!

zcbenz left a comment

Uh oh!

zcbenz Dec 6, 2025

Uh oh!

zcbenz Dec 6, 2025

Uh oh!

awni Dec 6, 2025

Uh oh!

zcbenz Dec 6, 2025

Uh oh!

awni Dec 6, 2025

Uh oh!

awni Dec 6, 2025

Uh oh!

zcbenz Dec 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

awni commented Dec 4, 2025

Uh oh!

awni commented Dec 4, 2025

Uh oh!

awni commented Dec 4, 2025

Uh oh!

awni commented Dec 5, 2025

Uh oh!

awni commented Dec 5, 2025

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

zcbenz Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

zcbenz Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

awni Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

zcbenz Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

awni Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

awni Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

zcbenz Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants