Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MemoryCache creates lots of tasks that cause CPU/Memory spikes on compaction. #97736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
myk0la999 opened this issue Jan 31, 2024 · 8 comments
Closed
Labels
Milestone

Comments

@myk0la999
Copy link
Contributor

myk0la999 commented Jan 31, 2024

Description:

The problem we’re addressing here is the significant increase in memory and processor usage when a cache nears its size limit and begins the compaction process. This issue becomes particularly noticeable under certain conditions.

Specifically, this situation arises when the cache size limit is sufficiently large and items are being added to the cache frequently. As the cache fills up and approaches its size limit, it triggers the compaction process, leading to a surge in memory and processor usage.

The reason behind this surge is tied to the TriggerOvercapacityCompaction() function. Each time an item is set in the cache, this function creates a task that runs Compact(). So in case if items are added to cache frequently when cache reached its size limit, a lot of tasks are created. However, having multiple threads running the Compact() function doesn’t actually speed up the compaction process. This is because all these threads are essentially performing the same work of iterating over all cache entries and sorting them by last accessed time. Therefore, there’s no benefit to having more than one Compact() call running simultaneously. Large number of Compact() tasks running simultaniously leads to the increased memory and processor usage. This seems to happenning when number of workers in thread pool is sufficiently large, for example when COMPlus_ThreadPool_ForceMinWorkerThreads environment variable is set and has a big value.

This was discovered on and tested on .net 6 but from code seems like the same problem exists on other versions too.

How the problem was discovered

After we started using MemoryCache in our application we started seeing short server load, CPU and memory spikes.

Load + requests:

image

CPU load at the same time when big spike happenned:

image

Memory usage (first image on is needed to see usage before spike, so that we can see that it has increased from 30GB to 95GB and then returned back to normal after spike ended):

image
image

To find what exactly caused those spikes we collected a diagnostic trace during such a spike.

By analyzing the diagnostic trace, we can see that owerwhelming ammount of time is taken by MemoryCache.Compact(). But we newer call it from our code, so its only calls that are made by MemoryCache itself.

image
image

Problem in code

The problem lies in MemoryCache.cs, where on each set, if UpdateCacheSizeExceedsCapacity() evaluates to true, TriggerOvercapacityCompaction() gets called, and it runs OvercapacityCompaction() on a thread from thread pool. And if cache set happens frequently enough too many tasks that run compaction are created.

private void TriggerOvercapacityCompaction()
{
    if (_logger.IsEnabled(LogLevel.Debug))
        _logger.LogDebug("Overcapacity compaction triggered");

    // Here we dont have any mechanism that would protect us from running 
    // this on too many threads at the same time.
    ThreadPool.QueueUserWorkItem(s => ((MemoryCache)s!).OvercapacityCompaction(), this);
}

Making matters worse, running compaction on multiple threads does not make it faster, since many threads will already fill priorities buckets lists in Compact() with whole cache before any entries are actually removed. So those threads will still sort all of those items by last accessed time in ExpirePriorityBucket() even if they are already removed from underlying ConcurrentDictionary making them run even longer (especially if cache size is huge).

Reproducing it locally

Here i have created a simple script that reproduces the problem locally on my machine on .net 6:

using Microsoft.Extensions.Caching.Memory;

// This is needed to simulate server environment, where lots of threads are awailable in thread pool.
// Also COMPlus_ThreadPool_ForceMinWorkerThreads environment variable can be used instead 
ThreadPool.SetMinThreads(1000, 1000);

// Here SizeLimit number should be high enough to be able to reproduce the issue
IMemoryCache memoryCache = new MemoryCache(new MemoryCacheOptions { SizeLimit = 700000 });

long key = 0;

// Fully fill in cache to its maximum capacity to simulate moment when capacity limit is reached
for (int i = 0; i < 700000; i++)
{
    memoryCache.Set(key.ToString(), 0, new MemoryCacheEntryOptions { Size = 1, AbsoluteExpirationRelativeToNow = TimeSpan.FromSeconds(1000) });
    key++;
}

Console.WriteLine("Filled up cache initially. Starting setting cache values in 1 second.");
Thread.Sleep(1000);

var tasks = new List<Task>();

// Here we are creating lots of tasks that add items to cache at the same time
// to simulate busy cache. Numbers here are just magic numbers, but number of threads 
// and number of items added by each thread should be high enough to reproduce the issue
for (long i = 0; i < 70; i++) {
    var startIndex = 7000 * i;
    var endIndex = startIndex + 7000 - 1;

    startIndex += key;
    endIndex += key;

    // Create threads that will be adding items to cache
    var task = Task.Run(() => { setCacheValues(startIndex, endIndex); });
    tasks.Add(task);
}

// Wait untill all items were added (since we dont have enough space for all of them, compact will be frequentry triggered to free up some space)
Task.WhenAll(tasks).Wait();

Console.WriteLine("Finished adding elements.");

void setCacheValues(long startIndex, long endIndex) {
    for (long i = startIndex; i < endIndex; i++) {
        memoryCache.Set(i.ToString(), 0, new MemoryCacheEntryOptions { Size = 1, AbsoluteExpirationRelativeToNow = TimeSpan.FromSeconds(1000) });
    }
}

When running it, after initially filling up the cache to its full capacity you would notice that memory usage would grow very fast and can reach very high values. (Abnormal, since after filling cache up to its capacity its should not grow inside, because it should remove old items to add new ones) Also you would notice that it takes abnormally long to run this untill it outputs "Finished adding elements." (20 minutes on my machine).

We took a memory dump about a minute after starting this script. Analyzing it with visual studio we can see that we have 81k compact tasks waiting in thread pool queue, which prooves that
TriggerOvercapacityCompaction() spams them.

image

We also used WinDebug and SOS !dumpheap command to analyze dump produced by this script. We can see that we have 12GB of memory taken by objects of type Microsoft.Extensions.Caching.Memory.MemoryCache+CompactPriorityEntry[]. This type in used only in Compact() and not referenced anywhere else. So we can see that it is the reason that causes high memory usage.

image

Proposed solution

As a solution we can limit the number of threads that can run compact at the same time to one. This will not cause compacting to be slower since having multiple threads run Compact() doesnt make it faster, since they all would do the same work.

This can be done by replacing current TriggerOvercapacityCompaction() with:

private int lockFlag = 0;

private void TriggerOvercapacityCompaction()
{
    if (_logger.IsEnabled(LogLevel.Debug))
        _logger.LogDebug("Overcapacity compaction triggered");

    // If no threads are currently running compact - enter lock and start compact
    // If there is already a thread that is running compact - do nothing
    if (Interlocked.CompareExchange(ref lockFlag, 1, 0) == 0) 
        // Spawn background thread for compaction
        ThreadPool.QueueUserWorkItem(s =>
        {
            try
            {
                ((MemoryCache)s!).OvercapacityCompaction();
            }
            finally {
                lockFlag = 0; // Release the lock
            }
        }, this);
}

After making this change local problem has resolved and now it takes 5 seconds to run the script that was running 20 minutes before and now it consumes less then 500MB memory, when before it consumed up to 20GB.

@myk0la999 myk0la999 added the tenet-performance Performance related issue label Jan 31, 2024
@ghost ghost added untriaged New issue has not been triaged by the area owner area-Extensions-Caching labels Jan 31, 2024
@ghost
Copy link

ghost commented Jan 31, 2024

Tagging subscribers to this area: @dotnet/area-extensions-caching
See info in area-owners.md if you want to be subscribed.

Issue Details

Title:

MemoryCache creates lots of tasks that cause CPU/Memory spikes on compaction.

Description:

The problem we’re addressing here is the significant increase in memory and processor usage when a cache nears its size limit and begins the compaction process. This issue becomes particularly noticeable under certain conditions.

Specifically, this situation arises when the cache size limit is sufficiently large and items are being added to the cache frequently. As the cache fills up and approaches its size limit, it triggers the compaction process, leading to a surge in memory and processor usage.

The reason behind this surge is tied to the TriggerOvercapacityCompaction() function. Each time an item is set in the cache, this function creates a task that runs Compact(). So in case if items are added to cache frequently when cache reached its size limit, a lot of tasks are created. However, having multiple threads running the Compact() function doesn’t actually speed up the compaction process. This is because all these threads are essentially performing the same work of iterating over all cache entries and sorting them by last accessed time. Therefore, there’s no benefit to having more than one Compact() call running simultaneously. Large number of Compact() tasks running simultaniously leads to the increased memory and processor usage. This seems to happenning when number of workers in thread pool is sufficiently large, for example when COMPlus_ThreadPool_ForceMinWorkerThreads environment variable is set and has a big value.

This was discovered on and tested on .net 6 but from code seems like the same problem exists on other versions too.

How the problem was discovered

After we started using MemoryCache in our application we started seeing short server load, CPU and memory spikes.

Load + requests:

image

CPU load at the same time when big spike happenned:

image

Memory usage (first image on is needed to see usage before spike, so that we can see that it has increased from 30GB to 95GB and then returned back to normal after spike ended):

image
image

To find what exactly caused those spikes we collected a diagnostic trace during such a spike.

By analyzing the diagnostic trace, we can see that owerwhelming ammount of time is taken by MemoryCache.Compact(). But we newer call it from our code, so its only calls that are made by MemoryCache itself.

image
image

Problem in code

The problem lies in MemoryCache.cs, where on each set, if UpdateCacheSizeExceedsCapacity() evaluates to true, TriggerOvercapacityCompaction() gets called, and it runs OvercapacityCompaction() on a thread from thread pool. And if cache set happens frequently enough too many tasks that run compaction are created.

private void TriggerOvercapacityCompaction()
{
    if (_logger.IsEnabled(LogLevel.Debug))
        _logger.LogDebug("Overcapacity compaction triggered");

    // Here we dont have any mechanism that would protect us from running 
    // this on too many threads at the same time.
    ThreadPool.QueueUserWorkItem(s => ((MemoryCache)s!).OvercapacityCompaction(), this);
}

Making matters worse, running compaction on multiple threads does not make it faster, since many threads will already fill priorities buckets lists in Compact() with whole cache before any entries are actually removed. So those threads will still sort all of those items by last accessed time in ExpirePriorityBucket() even if they are already removed from underlying ConcurrentDictionary making them run even longer (especially if cache size is huge).

Reproducing it locally

Here i have created a simple script that reproduces the problem locally on my machine on .net 6:

using Microsoft.Extensions.Caching.Memory;

// This is needed to simulate server environment, where lots of threads are awailable in thread pool.
// Also COMPlus_ThreadPool_ForceMinWorkerThreads environment variable can be used instead 
ThreadPool.SetMinThreads(1000, 1000);

// Here SizeLimit number should be high enough to be able to reproduce the issue
IMemoryCache memoryCache = new MemoryCache(new MemoryCacheOptions { SizeLimit = 700000 });

long key = 0;

// Fully fill in cache to its maximum capacity to simulate moment when capacity limit is reached
for (int i = 0; i < 700000; i++)
{
    memoryCache.Set(key.ToString(), 0, new MemoryCacheEntryOptions { Size = 1, AbsoluteExpirationRelativeToNow = TimeSpan.FromSeconds(1000) });
    key++;
}

Console.WriteLine("Filled up cache initially. Starting setting cache values in 1 second.");
Thread.Sleep(1000);

var tasks = new List<Task>();

// Here we are creating lots of tasks that add items to cache at the same time
// to simulate busy cache. Numbers here are just magic numbers, but number of threads 
// and number of items added by each thread should be high enough to reproduce the issue
for (long i = 0; i < 70; i++) {
    var startIndex = 7000 * i;
    var endIndex = startIndex + 7000 - 1;

    startIndex += key;
    endIndex += key;

    // Create threads that will be adding items to cache
    var task = Task.Run(() => { setCacheValues(startIndex, endIndex); });
    tasks.Add(task);
}

// Wait untill all items were added (since we dont have enough space for all of them, compact will be frequentry triggered to free up some space)
Task.WhenAll(tasks).Wait();

Console.WriteLine("Finished adding elements.");

void setCacheValues(long startIndex, long endIndex) {
    for (long i = startIndex; i < endIndex; i++) {
        memoryCache.Set(i.ToString(), 0, new MemoryCacheEntryOptions { Size = 1, AbsoluteExpirationRelativeToNow = TimeSpan.FromSeconds(1000) });
    }
}

When running it, after initially filling up the cache to its full capacity you would notice that memory usage would grow very fast and can reach very high values. (Abnormal, since after filling cache up to its capacity its should not grow inside, because it should remove old items to add new ones) Also you would notice that it takes abnormally long to run this untill it outputs "Finished adding elements." (20 minutes on my machine).

We took a memory dump about a minute after starting this script. Analyzing it with visual studio we can see that we have 81k compact tasks waiting in thread pool queue, which prooves that
TriggerOvercapacityCompaction() spams them.

image

We also used WinDebug and SOS !dumpheap command to analyze dump produced by this script. We can see that we have 12GB of memory taken by objects of type Microsoft.Extensions.Caching.Memory.MemoryCache+CompactPriorityEntry[]. This type in used only in Compact() and not referenced anywhere else. So we can see that it is the reason that causes high memory usage.

image

Proposed solution

As a solution we can limit the number of threads that can run compact at the same time to one. This will not cause compacting to be slower since having multiple threads run Compact() doesnt make it faster, since they all would do the same work.

This can be done by replacing current TriggerOvercapacityCompaction() with:

private int lockFlag = 0;

private void TriggerOvercapacityCompaction()
{
    if (_logger.IsEnabled(LogLevel.Debug))
        _logger.LogDebug("Overcapacity compaction triggered");

    // If no threads are currently running compact - enter lock and start compact
    // If there is already a thread that is running compact - do nothing
    if (Interlocked.CompareExchange(ref lockFlag, 1, 0) == 0) 
        // Spawn background thread for compaction
        ThreadPool.QueueUserWorkItem(s =>
        {
            try
            {
                ((MemoryCacheChanged)s!).OvercapacityCompaction();
            }
            finally {
                lockFlag = 0; // Release the lock
            }
        }, this);
}

After making this change local problem has resolved and now it takes 5 seconds to run the script that was running 20 minutes before and now it consumes less then 500MB memory, when before it consumed up to 20GB.

Author: myk0la999
Assignees: -
Labels:

tenet-performance, untriaged, area-Extensions-Caching

Milestone: -

@jozkee
Copy link
Member

jozkee commented Jun 25, 2024

Thanks for filing this. Your solution makes sense to me, are you interested in submitting a PR with your fix?

@jozkee jozkee added this to the 9.0.0 milestone Jun 25, 2024
@julealgon
Copy link

Does this also happen with the new HybridCache? Does it rely on the existing MemoryCache or does it do this part from scratch?

@jozkee
Copy link
Member

jozkee commented Jun 25, 2024

Does this also happen with the new HybridCache?

@mgravell

@myk0la999
Copy link
Contributor Author

myk0la999 commented Jun 25, 2024

Thanks for filing this. Your solution makes sense to me, are you interested in submitting a PR with your fix?

Yes, i would be interested in creating PR for this, but i just noticed that somebody already created PR: #103992

I think it would be fair if I had a chance to continue working on a fix for this issue since I have spent a lot of time to investigate the problem, created this issue and proposed the solution.

So what should I do in this case?

@jozkee
Copy link
Member

jozkee commented Jun 26, 2024

@myk0la999 you could review the change. Additionally, I hope @ADNewsom09 can amend the first commit to list you as the author and he as the committer, hopefully that gives both of you credit.

cc @richlander @danmoseley

@myk0la999
Copy link
Contributor Author

@myk0la999 you could review the change.

The change is exactly what i proposed, so i there is nothing new for me to review. But there was a question about memory model on that MR and i answered it. (I am technically not MR author but since i investigated this before proposing the solution, i already have an answer to that question)

@jozkee jozkee closed this as completed in 7899950 Jun 28, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jul 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants