MemoryCache creates lots of tasks that cause CPU/Memory spikes on compaction. #97736

myk0la999 · 2024-01-31T00:17:53Z

Description:

The problem we’re addressing here is the significant increase in memory and processor usage when a cache nears its size limit and begins the compaction process. This issue becomes particularly noticeable under certain conditions.

Specifically, this situation arises when the cache size limit is sufficiently large and items are being added to the cache frequently. As the cache fills up and approaches its size limit, it triggers the compaction process, leading to a surge in memory and processor usage.

The reason behind this surge is tied to the TriggerOvercapacityCompaction() function. Each time an item is set in the cache, this function creates a task that runs Compact(). So in case if items are added to cache frequently when cache reached its size limit, a lot of tasks are created. However, having multiple threads running the Compact() function doesn’t actually speed up the compaction process. This is because all these threads are essentially performing the same work of iterating over all cache entries and sorting them by last accessed time. Therefore, there’s no benefit to having more than one Compact() call running simultaneously. Large number of Compact() tasks running simultaniously leads to the increased memory and processor usage. This seems to happenning when number of workers in thread pool is sufficiently large, for example when COMPlus_ThreadPool_ForceMinWorkerThreads environment variable is set and has a big value.

This was discovered on and tested on .net 6 but from code seems like the same problem exists on other versions too.

How the problem was discovered

After we started using MemoryCache in our application we started seeing short server load, CPU and memory spikes.

Load + requests:

CPU load at the same time when big spike happenned:

Memory usage (first image on is needed to see usage before spike, so that we can see that it has increased from 30GB to 95GB and then returned back to normal after spike ended):

To find what exactly caused those spikes we collected a diagnostic trace during such a spike.

By analyzing the diagnostic trace, we can see that owerwhelming ammount of time is taken by MemoryCache.Compact(). But we newer call it from our code, so its only calls that are made by MemoryCache itself.

Problem in code

The problem lies in MemoryCache.cs, where on each set, if UpdateCacheSizeExceedsCapacity() evaluates to true, TriggerOvercapacityCompaction() gets called, and it runs OvercapacityCompaction() on a thread from thread pool. And if cache set happens frequently enough too many tasks that run compaction are created.

private void TriggerOvercapacityCompaction()
{
    if (_logger.IsEnabled(LogLevel.Debug))
        _logger.LogDebug("Overcapacity compaction triggered");

    // Here we dont have any mechanism that would protect us from running 
    // this on too many threads at the same time.
    ThreadPool.QueueUserWorkItem(s => ((MemoryCache)s!).OvercapacityCompaction(), this);
}

Making matters worse, running compaction on multiple threads does not make it faster, since many threads will already fill priorities buckets lists in Compact() with whole cache before any entries are actually removed. So those threads will still sort all of those items by last accessed time in ExpirePriorityBucket() even if they are already removed from underlying ConcurrentDictionary making them run even longer (especially if cache size is huge).

Reproducing it locally

Here i have created a simple script that reproduces the problem locally on my machine on .net 6:

using Microsoft.Extensions.Caching.Memory;

// This is needed to simulate server environment, where lots of threads are awailable in thread pool.
// Also COMPlus_ThreadPool_ForceMinWorkerThreads environment variable can be used instead 
ThreadPool.SetMinThreads(1000, 1000);

// Here SizeLimit number should be high enough to be able to reproduce the issue
IMemoryCache memoryCache = new MemoryCache(new MemoryCacheOptions { SizeLimit = 700000 });

long key = 0;

// Fully fill in cache to its maximum capacity to simulate moment when capacity limit is reached
for (int i = 0; i < 700000; i++)
{
    memoryCache.Set(key.ToString(), 0, new MemoryCacheEntryOptions { Size = 1, AbsoluteExpirationRelativeToNow = TimeSpan.FromSeconds(1000) });
    key++;
}

Console.WriteLine("Filled up cache initially. Starting setting cache values in 1 second.");
Thread.Sleep(1000);

var tasks = new List<Task>();

// Here we are creating lots of tasks that add items to cache at the same time
// to simulate busy cache. Numbers here are just magic numbers, but number of threads 
// and number of items added by each thread should be high enough to reproduce the issue
for (long i = 0; i < 70; i++) {
    var startIndex = 7000 * i;
    var endIndex = startIndex + 7000 - 1;

    startIndex += key;
    endIndex += key;

    // Create threads that will be adding items to cache
    var task = Task.Run(() => { setCacheValues(startIndex, endIndex); });
    tasks.Add(task);
}

// Wait untill all items were added (since we dont have enough space for all of them, compact will be frequentry triggered to free up some space)
Task.WhenAll(tasks).Wait();

Console.WriteLine("Finished adding elements.");

void setCacheValues(long startIndex, long endIndex) {
    for (long i = startIndex; i < endIndex; i++) {
        memoryCache.Set(i.ToString(), 0, new MemoryCacheEntryOptions { Size = 1, AbsoluteExpirationRelativeToNow = TimeSpan.FromSeconds(1000) });
    }
}

When running it, after initially filling up the cache to its full capacity you would notice that memory usage would grow very fast and can reach very high values. (Abnormal, since after filling cache up to its capacity its should not grow inside, because it should remove old items to add new ones) Also you would notice that it takes abnormally long to run this untill it outputs "Finished adding elements." (20 minutes on my machine).

We took a memory dump about a minute after starting this script. Analyzing it with visual studio we can see that we have 81k compact tasks waiting in thread pool queue, which prooves that
TriggerOvercapacityCompaction() spams them.

We also used WinDebug and SOS !dumpheap command to analyze dump produced by this script. We can see that we have 12GB of memory taken by objects of type Microsoft.Extensions.Caching.Memory.MemoryCache+CompactPriorityEntry[]. This type in used only in Compact() and not referenced anywhere else. So we can see that it is the reason that causes high memory usage.

Proposed solution

As a solution we can limit the number of threads that can run compact at the same time to one. This will not cause compacting to be slower since having multiple threads run Compact() doesnt make it faster, since they all would do the same work.

This can be done by replacing current TriggerOvercapacityCompaction() with:

private int lockFlag = 0;

private void TriggerOvercapacityCompaction()
{
    if (_logger.IsEnabled(LogLevel.Debug))
        _logger.LogDebug("Overcapacity compaction triggered");

    // If no threads are currently running compact - enter lock and start compact
    // If there is already a thread that is running compact - do nothing
    if (Interlocked.CompareExchange(ref lockFlag, 1, 0) == 0) 
        // Spawn background thread for compaction
        ThreadPool.QueueUserWorkItem(s =>
        {
            try
            {
                ((MemoryCache)s!).OvercapacityCompaction();
            }
            finally {
                lockFlag = 0; // Release the lock
            }
        }, this);
}

After making this change local problem has resolved and now it takes 5 seconds to run the script that was running 20 minutes before and now it consumes less then 500MB memory, when before it consumed up to 20GB.

The text was updated successfully, but these errors were encountered:

ghost · 2024-01-31T00:18:03Z

Tagging subscribers to this area: @dotnet/area-extensions-caching
See info in area-owners.md if you want to be subscribed.

Issue Details

Title:

MemoryCache creates lots of tasks that cause CPU/Memory spikes on compaction.

Description:

The problem we’re addressing here is the significant increase in memory and processor usage when a cache nears its size limit and begins the compaction process. This issue becomes particularly noticeable under certain conditions.

Specifically, this situation arises when the cache size limit is sufficiently large and items are being added to the cache frequently. As the cache fills up and approaches its size limit, it triggers the compaction process, leading to a surge in memory and processor usage.

The reason behind this surge is tied to the TriggerOvercapacityCompaction() function. Each time an item is set in the cache, this function creates a task that runs Compact(). So in case if items are added to cache frequently when cache reached its size limit, a lot of tasks are created. However, having multiple threads running the Compact() function doesn’t actually speed up the compaction process. This is because all these threads are essentially performing the same work of iterating over all cache entries and sorting them by last accessed time. Therefore, there’s no benefit to having more than one Compact() call running simultaneously. Large number of Compact() tasks running simultaniously leads to the increased memory and processor usage. This seems to happenning when number of workers in thread pool is sufficiently large, for example when COMPlus_ThreadPool_ForceMinWorkerThreads environment variable is set and has a big value.

This was discovered on and tested on .net 6 but from code seems like the same problem exists on other versions too.

How the problem was discovered

After we started using MemoryCache in our application we started seeing short server load, CPU and memory spikes.

Load + requests:

CPU load at the same time when big spike happenned:

Memory usage (first image on is needed to see usage before spike, so that we can see that it has increased from 30GB to 95GB and then returned back to normal after spike ended):

To find what exactly caused those spikes we collected a diagnostic trace during such a spike.

By analyzing the diagnostic trace, we can see that owerwhelming ammount of time is taken by MemoryCache.Compact(). But we newer call it from our code, so its only calls that are made by MemoryCache itself.

Problem in code

The problem lies in MemoryCache.cs, where on each set, if UpdateCacheSizeExceedsCapacity() evaluates to true, TriggerOvercapacityCompaction() gets called, and it runs OvercapacityCompaction() on a thread from thread pool. And if cache set happens frequently enough too many tasks that run compaction are created.

private void TriggerOvercapacityCompaction()
{
    if (_logger.IsEnabled(LogLevel.Debug))
        _logger.LogDebug("Overcapacity compaction triggered");

    // Here we dont have any mechanism that would protect us from running 
    // this on too many threads at the same time.
    ThreadPool.QueueUserWorkItem(s => ((MemoryCache)s!).OvercapacityCompaction(), this);
}

Making matters worse, running compaction on multiple threads does not make it faster, since many threads will already fill priorities buckets lists in Compact() with whole cache before any entries are actually removed. So those threads will still sort all of those items by last accessed time in ExpirePriorityBucket() even if they are already removed from underlying ConcurrentDictionary making them run even longer (especially if cache size is huge).

Reproducing it locally

Here i have created a simple script that reproduces the problem locally on my machine on .net 6:

using Microsoft.Extensions.Caching.Memory;

// This is needed to simulate server environment, where lots of threads are awailable in thread pool.
// Also COMPlus_ThreadPool_ForceMinWorkerThreads environment variable can be used instead 
ThreadPool.SetMinThreads(1000, 1000);

// Here SizeLimit number should be high enough to be able to reproduce the issue
IMemoryCache memoryCache = new MemoryCache(new MemoryCacheOptions { SizeLimit = 700000 });

long key = 0;

// Fully fill in cache to its maximum capacity to simulate moment when capacity limit is reached
for (int i = 0; i < 700000; i++)
{
    memoryCache.Set(key.ToString(), 0, new MemoryCacheEntryOptions { Size = 1, AbsoluteExpirationRelativeToNow = TimeSpan.FromSeconds(1000) });
    key++;
}

Console.WriteLine("Filled up cache initially. Starting setting cache values in 1 second.");
Thread.Sleep(1000);

var tasks = new List<Task>();

// Here we are creating lots of tasks that add items to cache at the same time
// to simulate busy cache. Numbers here are just magic numbers, but number of threads 
// and number of items added by each thread should be high enough to reproduce the issue
for (long i = 0; i < 70; i++) {
    var startIndex = 7000 * i;
    var endIndex = startIndex + 7000 - 1;

    startIndex += key;
    endIndex += key;

    // Create threads that will be adding items to cache
    var task = Task.Run(() => { setCacheValues(startIndex, endIndex); });
    tasks.Add(task);
}

// Wait untill all items were added (since we dont have enough space for all of them, compact will be frequentry triggered to free up some space)
Task.WhenAll(tasks).Wait();

Console.WriteLine("Finished adding elements.");

void setCacheValues(long startIndex, long endIndex) {
    for (long i = startIndex; i < endIndex; i++) {
        memoryCache.Set(i.ToString(), 0, new MemoryCacheEntryOptions { Size = 1, AbsoluteExpirationRelativeToNow = TimeSpan.FromSeconds(1000) });
    }
}

When running it, after initially filling up the cache to its full capacity you would notice that memory usage would grow very fast and can reach very high values. (Abnormal, since after filling cache up to its capacity its should not grow inside, because it should remove old items to add new ones) Also you would notice that it takes abnormally long to run this untill it outputs "Finished adding elements." (20 minutes on my machine).

We took a memory dump about a minute after starting this script. Analyzing it with visual studio we can see that we have 81k compact tasks waiting in thread pool queue, which prooves that
TriggerOvercapacityCompaction() spams them.

We also used WinDebug and SOS !dumpheap command to analyze dump produced by this script. We can see that we have 12GB of memory taken by objects of type Microsoft.Extensions.Caching.Memory.MemoryCache+CompactPriorityEntry[]. This type in used only in Compact() and not referenced anywhere else. So we can see that it is the reason that causes high memory usage.

Proposed solution

As a solution we can limit the number of threads that can run compact at the same time to one. This will not cause compacting to be slower since having multiple threads run Compact() doesnt make it faster, since they all would do the same work.

This can be done by replacing current TriggerOvercapacityCompaction() with:

private int lockFlag = 0;

private void TriggerOvercapacityCompaction()
{
    if (_logger.IsEnabled(LogLevel.Debug))
        _logger.LogDebug("Overcapacity compaction triggered");

    // If no threads are currently running compact - enter lock and start compact
    // If there is already a thread that is running compact - do nothing
    if (Interlocked.CompareExchange(ref lockFlag, 1, 0) == 0) 
        // Spawn background thread for compaction
        ThreadPool.QueueUserWorkItem(s =>
        {
            try
            {
                ((MemoryCacheChanged)s!).OvercapacityCompaction();
            }
            finally {
                lockFlag = 0; // Release the lock
            }
        }, this);
}

After making this change local problem has resolved and now it takes 5 seconds to run the script that was running 20 minutes before and now it consumes less then 500MB memory, when before it consumed up to 20GB.

Author:	myk0la999
Assignees:	-
Labels:	`tenet-performance`, `untriaged`, `area-Extensions-Caching`
Milestone:	-

jozkee · 2024-06-25T15:27:58Z

Thanks for filing this. Your solution makes sense to me, are you interested in submitting a PR with your fix?

julealgon · 2024-06-25T15:34:41Z

Does this also happen with the new HybridCache? Does it rely on the existing MemoryCache or does it do this part from scratch?

jozkee · 2024-06-25T15:38:02Z

Does this also happen with the new HybridCache?

@mgravell

myk0la999 · 2024-06-25T23:52:28Z

Thanks for filing this. Your solution makes sense to me, are you interested in submitting a PR with your fix?

Yes, i would be interested in creating PR for this, but i just noticed that somebody already created PR: #103992

I think it would be fair if I had a chance to continue working on a fix for this issue since I have spent a lot of time to investigate the problem, created this issue and proposed the solution.

So what should I do in this case?

jozkee · 2024-06-26T18:28:48Z

@myk0la999 you could review the change. Additionally, I hope @ADNewsom09 can amend the first commit to list you as the author and he as the committer, hopefully that gives both of you credit.

cc @richlander @danmoseley

ADNewsom09 · 2024-06-26T19:44:23Z

I spent a bit trying to amend the older commit following https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/changing-a-commit-message and https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors and haven't been able to put everything together.

myk0la999 · 2024-06-27T14:54:42Z

@myk0la999 you could review the change.

The change is exactly what i proposed, so i there is nothing new for me to review. But there was a question about memory model on that MR and i answered it. (I am technically not MR author but since i investigated this before proposing the solution, i already have an answer to that question)

myk0la999 added the tenet-performance Performance related issue label Jan 31, 2024

ghost added untriaged New issue has not been triaged by the area owner area-Extensions-Caching labels Jan 31, 2024

jozkee added this to the 9.0.0 milestone Jun 25, 2024

jozkee removed the untriaged New issue has not been triaged by the area owner label Jun 25, 2024

ADNewsom09 mentioned this issue Jun 25, 2024

Fix #97736 in MemoryCache.cs with a lock around the TriggerOvercapacityC... method #103992

Merged

jozkee closed this as completed in 7899950 Jun 28, 2024

github-actions bot locked and limited conversation to collaborators Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryCache creates lots of tasks that cause CPU/Memory spikes on compaction. #97736

MemoryCache creates lots of tasks that cause CPU/Memory spikes on compaction. #97736

myk0la999 commented Jan 31, 2024 •

edited

Loading

ghost commented Jan 31, 2024

Title:

Description:

How the problem was discovered

Problem in code

Reproducing it locally

Proposed solution

jozkee commented Jun 25, 2024

julealgon commented Jun 25, 2024

jozkee commented Jun 25, 2024

myk0la999 commented Jun 25, 2024 •

edited

Loading

jozkee commented Jun 26, 2024

ADNewsom09 commented Jun 26, 2024

myk0la999 commented Jun 27, 2024

MemoryCache creates lots of tasks that cause CPU/Memory spikes on compaction. #97736

MemoryCache creates lots of tasks that cause CPU/Memory spikes on compaction. #97736

Comments

myk0la999 commented Jan 31, 2024 • edited Loading

Description:

How the problem was discovered

Problem in code

Reproducing it locally

Proposed solution

ghost commented Jan 31, 2024

Title:

Description:

How the problem was discovered

Problem in code

Reproducing it locally

Proposed solution

jozkee commented Jun 25, 2024

julealgon commented Jun 25, 2024

jozkee commented Jun 25, 2024

myk0la999 commented Jun 25, 2024 • edited Loading

jozkee commented Jun 26, 2024

ADNewsom09 commented Jun 26, 2024

myk0la999 commented Jun 27, 2024

myk0la999 commented Jan 31, 2024 •

edited

Loading

myk0la999 commented Jun 25, 2024 •

edited

Loading