Try not to fail when there should be memory available#2869
Try not to fail when there should be memory available#2869awni merged 3 commits intoml-explore:mainfrom
Conversation
5eb48e1 to
158accb
Compare
|
I think there is a performance issue here so moving into draft. |
|
Evidently calling |
c4b20cd to
7a09a4a
Compare
c5ff2e2 to
3a4b72d
Compare
3a4b72d to
b3d6566
Compare
|
Ok I fixed this and I don't see a regression in perf. I think the basic premise for what is happening is that even though the MLX cache + active memory is well under the limit, there is fragmentation and since we are using async free, the device is not able return memory to the OS before every time we call malloc, and so CUDA can fail to allocate free memory even when the total amount of free memory exceeds the requested allocation. |
|
I uploaded a script that repro's the issue on B200 (and should on H100 for smaller batch size). Just leaving it here for reference. |
mlx/backend/cuda/allocator.cpp
Outdated
| return loc; | ||
| } | ||
| #else | ||
| int cuda_mem_loc(int i) { |
| size_t used = 0; | ||
| CHECK_CUDA_ERROR(cudaMemPoolGetAttribute( | ||
| p, cudaMemPoolAttrReservedMemCurrent, &used)); | ||
| if (used > (total_memory_ - free_limit_)) { |
There was a problem hiding this comment.
Why having a free_limit_? The code would read easier for me if it is just:
if (used > memory_limit_) {
buffer_cache_.release_cached_buffers(total_memory_ - memory_limit_);
}There was a problem hiding this comment.
Good question. memory_limit_ can change (the user can set the memory limit to be higher or lower). I wanted a value that was fixed based on the total device memory.
There was a problem hiding this comment.
What do you think about using hard_memory_limit_/soft_memory_limit_? (Just being nitpick, I'm good with free_limit_ too.)
There was a problem hiding this comment.
I don't really love hard_memory_limit cause it's not a hard limit.
It's more like a soft memory limit on the underlying cuda pool. I'll think a bit more on how to phrase it.
There was a problem hiding this comment.
The other thing that is a bit of a mess in our allocator especially is how we deal with multi-device on a discrete setup where each device has it's own memory.
I think at some point it might make sense to have separate buffer cache for each device and one for the managed allocator.
There are a couple cases where MLX can fail with OOM when there is actually memory available:
cudaMallocAsyncreturnsnullptreven when there should be enough RAM outside the cache + used memory. I believe this is due to fragmentation. Instead of failing on this case, we free from the cache then try again.