Improve ConcurrentBag performance #14126

stephentoub · 2016-11-30T16:41:01Z

The primary purpose of this change is to remove allocations prevalent in the use of ConcurrentBag:

Every Add allocates a Node object to be stored in a linked list.
Every Take that needs to steal allocates a List (and the associated underlying array) to store version information for all peer lists examined.

This PR removes all of those allocations:

The first allocation (per-Add Node) is addressed by changing from a linked-list-based scheme to an array-based scheme; the logic for the latter is taken from ThreadPool (https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Threading/ThreadPool.cs#L133), which uses an array-based work-stealing implementation for the thread-local queues in the pool.
The second allocation is avoided by caching a per-thread list in TLS, so that any steal operation can just use that list from TLS. These changes effectively eliminate all per-operation allocations in ConcurrentBag.

In addition to the GC impact, these changes have a beneficial impact on throughput, due to better memory locality with the array vs with a linked list, less code needed for pushes/pops, etc.

In a test that adds 10M items from a single thread:
Before (elapsed time and gen0 GCs):

00:00:02.2688554 67
00:00:02.6047316 66
00:00:02.1908836 65
00:00:02.7931760 68

After:

00:00:00.8325350 0
00:00:00.8284697 0
00:00:00.8600504 0
00:00:00.8094874 0

In a test that has 4 threads each adding 2 items then taking 2 items (no steals), repeatedly, 5M times:
Before:

00:00:02.9780786 767
00:00:02.7218693 767
00:00:02.7535170 767
00:00:02.7547491 767

After:

00:00:01.5986701 0
00:00:01.3661083 0
00:00:01.4061545 0
00:00:01.3687102 0

In a test that has one thread adding 10M items and another thread taking until all 10M are stolen (effectively the worst case for ConcurrentBag):
Before:

00:00:03.1549239 332
00:00:03.2892691 340
00:00:03.1984057 349
00:00:03.3697942 346

After:

00:00:02.5558069 0
00:00:01.8004534 0
00:00:02.5968542 0
00:00:02.1146025 0

Prior to this change, devs were often hesitant to use ConcurrentBag as an object pool for small objects, as each added object involved its own small Node allocation. After this change, there shouldn't be any such objections.

cc: @kouvel, @alexperovich, @benaadams
Fixes #14090
Fixes https://github.com/dotnet/corefx/issues/7917

alexperovich · 2016-11-30T19:20:57Z

src/System.Collections.Concurrent/src/System/Collections/Concurrent/ConcurrentBag.cs

                    {
-                        if (currentList._head != null)
-                        //at least this list is not empty, we return false
+                        FreezeBag(ref lockTaken);


Can this fail to take the lock?

Can this fail to take the lock?

Only in extreme cases, e.g. OOM while JIT'ing one of the helper methods. It's primarily a holdover from running on desktop where thread aborts are possible and the more common reason for failure.

Should we do something when this fails? It doesn't look like we bail out if lockTaken is false.

It'll only be false if an exception is getting thrown, e.g. an assert for true after this call would always pass.

benaadams · 2016-11-30T20:12:44Z

Is it worth doing the double pass for TrySteal that was originally suggested by the docs comment in https://github.com/dotnet/corefx/issues/11216?

e.g. first pass with TryEnter second pass with lock if nothing retrieved

stephentoub · 2016-11-30T20:16:59Z

Is it worth doing the double pass for TrySteal that was originally suggested by the docs comment in #11216? e.g. first pass with TryEnter second pass with lock if nothing retrieved

It would need to be measured, though my guess is that it originally did that, it was measured and shown to be bad or not worthwhile, and the code was updated to remove it but the comment got left behind as stale (I think that's more likely than someone just adding the comment and not implementing what the comment said).

Regardless, it'd be a fine thing to experiment with, but separately from this change. Such a change would very likely improve some situations and degrade others, so we'd need to make sure we were measuring the right scenarios.

benaadams · 2016-11-30T20:23:18Z

It would need to be measured, though my guess is that it originally did that, it was measured and shown to be bad or not worthwhile

TryEnter did use to have a lot of cost with it; which has now changed.

Regardless, it'd be a fine thing to experiment with, but separately from this change.

Makes sense.

Very happy with this change; ConcurrentBag is in theory a much better pooling structure; though in practice ConcurrentQueue works better - will be nice to bring back its advantages by cutting its costs 😄

benaadams · 2016-11-30T20:43:21Z

LGTM - very nice

@stephentoub could you add your perf tests as a gist so they can be reused for future perf investigations?

stephentoub · 2016-11-30T20:53:20Z

could you add your perf tests so they can be reused for future perf investigations?

Adding items from a single thread:

using System;
using System.Diagnostics;
using System.Collections.Concurrent;

class Test
{
    public static void Main()
    {
        var sw = new Stopwatch();
        while (true)
        {
            var bag = new ConcurrentBag<int>();
            int gen0 = GC.CollectionCount(0);
            sw.Restart();
            for (int i = 0; i < 10000000; i++) bag.Add(i);
            sw.Stop();
            Console.WriteLine(sw.Elapsed + " " + (GC.CollectionCount(0) - gen0));
        }
    }
}

Adding and taking items:

using System;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
using System.Collections.Concurrent;
using System.Linq;

class Test
{
    public static void Main()
    {
        var sw = new Stopwatch();
        var bag = new ConcurrentBag<int>();
        while (true)
        {
            int gen0 = GC.CollectionCount(0);
            sw.Restart();
            Task.WaitAll(Enumerable.Range(0, Environment.ProcessorCount).Select(_ => Task.Run(() =>
            {
                for (int i = 0; i < 5000000; i++)
                {
                    bag.Add(i);
                    bag.Add(i);
                    int item;
                    bag.TryTake(out item);
                    bag.TryTake(out item);
                }
            })).ToArray());
            sw.Stop();
            Console.WriteLine(sw.Elapsed + " " + (GC.CollectionCount(0) - gen0));
        }
    }
}

Adding and stealing items:

using System;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
using System.Collections.Concurrent;

class Test
{
    public static void Main()
    {
        var sw = new Stopwatch();
        var bag = new ConcurrentBag<int>();
        while (true)
        {
            int gen0 = GC.CollectionCount(0);
            sw.Restart();
            Task.WaitAll(
                Task.Run(() =>
                {
                    for (int i = 0; i < 10000000; i++) bag.Add(i);
                }),
                Task.Run(() =>
                {
                    int count = 0;
                    while (count < 10000000)
                    {
                        int item;
                        if (bag.TryTake(out item)) count++;
                    }
                }));
            sw.Stop();
            Console.WriteLine(sw.Elapsed + " " + (GC.CollectionCount(0) - gen0));
        }
    }
}

stephentoub · 2016-11-30T21:11:09Z

TryEnter did use to have a lot of cost with it; which has now changed.

True... probably worth another (separate) look given improvements that have been made there.

kouvel

Rest looks good to me

kouvel · 2016-12-01T04:34:12Z

src/System.Collections.Concurrent/src/System/Collections/Concurrent/ConcurrentBag.cs

-                    _head._prev = node;
-                    _head = node;
+                    // Full fence to ensure subsequent reads don't get reordered before this
+                    Interlocked.Exchange(ref _currentOp, (int)ConcurrentBagListOperation.Add);


It looks like these ordering restrictions are critical to prevent some races with TrySteal. It would be helpful to describe just one case that demonstrates that this memory barrier between the write and next read is necessary to avoid a race (and likewise for other similar barriers).

I believe the concern is that, with just the acquire/release semantics from the volatile reads/writes, it's theoretically possible for the subsequent reads of _tailIndex and _headIndex to move up before the write to _currentOp. If that were to happen, a steal could end up mucking with _headIndex without waiting for this in-flight local push to quiesce, even if _headIndex and _tailIndex were close enough together that the push and steal should have synchronized with each other. I'll add a comment.

kouvel · 2016-12-01T04:34:17Z

src/System.Collections.Concurrent/src/System/Collections/Concurrent/ConcurrentBag.cs

+                    // We're going to increment the tail; if we'll overflow, then we need to reset our counts
+                    if (tail == int.MaxValue)
+                    {
+                        lock (this)


Does _currentOp need to be reset to None here too, to avoid a deadlock with TrySteal, and set again after releasing the lock?

It definitely does. I'll fix. Good catch.

kouvel · 2016-12-01T04:34:20Z

src/System.Collections.Concurrent/src/System/Collections/Concurrent/ConcurrentBag.cs

+
+                    // When there are at least 2 elements' worth of space (and there aren't pending operations that
+                    // would force us to synchronize), we can take the fast path without locking.
+                    if (!needsSync && tail < (_headIndex + _mask))


Why is it necessary for there to be 2 elements' worth of space? I also don't understand why inside the lock, the array is considered full when there are still 2 elements' worth of space.

I had the same question, actually. This is copied from ThreadPool:
https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Threading/ThreadPool.cs#L187-L188
but I think it could probably be changed there as well. No one else besides this thread is going to change _tailIndex, and even if a concurrent operation steals, that can only increase _headIndex (making more room), so having a single slot remaining (i.e. doing "tail <= (_headIndex + _mask)") should be sufficient.

Ah, no, I see why it's necessary. Let's say that _array.Length == 32, _headIndex == 0, and _mask == 31. Let's also say that the array is full, such that _tailIndex == 32. In a normal case, "_tailIndex <= _headIndex + mask" would return false (32 > 0 + 31) as we'd want. But let's say that just as we're doing this comparison, a stealing thread comes along and increments _headIndex. Then _tailIndex <= _headIndex + mask would return true (32 <= 1 + 31), and we'd end up writing to the element that's currently being stolen (the stealing thread increments the _headIndex before it reads the value). That's why we need two slots: one to write into, and one to account for the one that may currently be in the process of being stolen. I'll add a comment.

I see, makes sense

kouvel · 2016-12-01T04:34:29Z

src/System.Collections.Concurrent/src/System/Collections/Concurrent/ConcurrentBag.cs

+                            if (_currentOp != (int)ConcurrentBagListOperation.None)
+                            {
+                                var spinner = new SpinWait();
+                                while (_currentOp != (int)ConcurrentBagListOperation.None)


If the owning thread is in a reasonable infinite loop of push/pop operations, is it possible for a different stealing thread to go into an unreasonable infinite loop here as a consequence? That is, does something need to be added here to prevent a potential infinite loop?

It's theoretically possible. This code hasn't changed, though, so that same issue exists in ConcurrentBag today. There are plenty of reasons why it's unlikely, e.g. the thread would have to constantly be pushing/popping, without touching the ends of the array that would force synchronization. If we're really concerned about it, I can imagine a few possible mitigations, e.g. add a counter to ThreadLocalList that's incremented on any local push/pop, and force a synchronization every N operations, where N is something reasonably large so as not to penalize the common case. We could also tie this in with @benaadams's suggestion of looking at whether we should do two passes during a steal: the first one doing a TryEnter(..., 0) and the second one doing an Enter(...)... we could also say that the first pass fails not only if TryEnter(..., 0) returns false but also if _currentOp != None. Again, though, this possible issue has existed since the type was introduced in .NET 4, and I've not heard of it being a problem, so I'd like to punt on it for this change.

That said, looking at this more, I'm not sure why the steal needs to do this loop at all. It makes sense for operations that need the bag to quiesce, e.g. ToArray. We may just be able to remove the loop entirely.

Ah, it appears to be necessary, but only for peeks. Peeks are special, in that they need to be able to read the value without taking it or preventing someone else from taking it, which means they can't mess with the head/tail pointers in a way that would prevent a steal from getting the item. As such, a peek needs to set this flag, so that if the peek contends with a steal, the steal waits for the peek to finish; otherwise, the peek can end up reading a default(T) or a torn T. I can change the condition here to old spin for peeks. It's not necessary to do so for pops or pushes, as the logic for handling contention with those is already built in to the rest of the logic.

Makes sense as well, thanks

I'm actually going to just simplify TryLocalPeek to make it use a lock. The logic to get a lock-free peek correct is complicated, due to not being able to do the same kind of index reservation that's done in TryLocalPop (if we reserve an index for a peek, that could cause a steal to fail even if the item is available). And by making it use a lock, we can remove this spinning from TrySteal entirely, avoiding any concerns about stalling threads indefinitely. Peeks are rare on a ConcurrentBag, so the overhead of using a lock always instead of just on the "slow" path should not be impactful; if it surprises us and ends up being so, we can revisit.

kouvel · 2016-12-01T04:34:33Z

src/System.Collections.Concurrent/src/System/Collections/Concurrent/ConcurrentBag.cs

+                        {
+                            // We contended with the local thread, so restore the head
+                            // and loop around to try again.
+                            _headIndex = head;


Since the head is incremented unconditionally, and similarly, the tail decremented unconditionally in TryLocalPop, it seems possible for a race between the two to cause the head index to temporarily be greater than the tail index. I didn't find a case where that becomes an issue, but I can't be 100% sure. Is there some obvious reason that I missed why this is never an issue? If so, I would find it helpful to have a comment about that.

it seems possible for a race between the two to cause the head index to temporarily be greater than the tail index

This is part of why local pops synchronize when there are <= 2 elements in the list. I don't see any way this would be a problem. Note again that this is the same logic in ThreadPool.

Maybe a minor issue:

There's one item in the bag

Pop decrements the tail and reads the head

Steal increments the head and reads the new tail

At this point, say head == 1 and tail == 0

Now steal fails and it would want to reset the head to 0, but the thread gets switched out before that

Pop completes, then on the same thread a Push occurs successfully, setting tail to 1

Push completes, and another Pop is done. This Pop now sees the list as empty because it sees head == tail == 1, but the list actually has one item. Not sure if this is allowed by the contract.

Wouldn't that failed Pop then Steal from itself?

TrySteal would also see the list as empty and fail fast

At the beginning of TryLocalPop, _headIndex is 0 and _tailIndex is 1. The condition for the fast path is equivalent to _headIndex <= _tailIndex - 1, so it looks to me like it would go into the fast path when there is one item.

Also since TryPeek calls TrySteal, which increments the head even for a peek, it could hide one item temporarily during a peek operation

The condition for the fast path is equivalent to _headIndex <= _tailIndex - 1

Ah, you and I are looking at different versions of this: my local copy already has this changed to _headIndex < tail, which I did to account for stealing peeks. That would also address the issue you're talking about. I'll push my updates shortly.

Also since TryPeek calls TrySteal

That's the stealing peeks issue I just mentioned :)

Ah ok, that works

kouvel · 2016-12-01T05:04:48Z

src/System.Collections.Concurrent/src/System/Collections/Concurrent/ConcurrentBag.cs

                    {
                        return true;
                    }
-                    currentList = currentList._nextList;
                }

                // verify versioning, if other items are added to this list since we last visit it, we should retry


If I understand correctly, this versioning seems to cache failed attempts to steal (presumably because those lists were empty) and keeps trying as long as a list from any thread got a version update (transitioned from empty to nonempty). Is there a benefit to doing this? Why not just fail after the first attempt at stealing and ditch versioning altogether?

I've been trying to come up with a scenario where the versioning helps prevent a "wrong" answer, i.e. where a TryTake is issued after an item has been added but where it's unable to steal that item and where versioning would avoid that... and I've not come up with one. This does seem superfluous. I'll remove it if I can't come up with a good reason for it.

DefineConstants was being overwritten, so DEBUG wasn't getting set in debug builds, which meant no asserts were enabled, DEBUG code wasn't compiled in, etc.

The primary purpose of this change is to remove allocations prevalent in the use of ConcurrentBag: - Every Add allocates a Node object to be stored in a linked list. - Every Take that needs to steal allocations a List (and the associated underlying array) to store version information for all lists. This commit removes all of those allocations. The first allocation (per-Add Node) is addressed by changing from a linked-list-based scheme to an array-based scheme; the logic for the latter is taken from ThreadPool, which uses an array-based work-stealing implementation for the thread-local queues in the pool. The second allocation is avoided by simply removing the versioning, as it is unnecessary in the stealing process. They also have secondary effects that significantly improve throughput, e.g. better memory locality with the array vs with a linked list, less code needed for pushes/pops.

Clean up a bunch of the existing tests, and add a bunch of new ones. Line coverage now typically hovers between 98% to 100%, and outputs are much more strongly verified.

stephentoub · 2016-12-01T22:40:28Z

Thanks for the reviews, @kouvel. I've updated the PR to address the feedback, add some more tests, and address a few issues I discovered while adding the tests.

stephentoub · 2016-12-02T01:18:10Z

Thanks, @kouvel, @alexperovich, and @benaadams.

Improve ConcurrentBag performance Commit migrated from dotnet/corefx@9f315d6

dnfclas added the cla-already-signed label Nov 30, 2016

karelz added the area-System.Collections label Nov 30, 2016

karelz assigned stephentoub Nov 30, 2016

stephentoub force-pushed the cb_perf branch from bee9e94 to 068e8c9 Compare November 30, 2016 17:59

stephentoub added the netfx-port-consider label Nov 30, 2016

alexperovich reviewed Nov 30, 2016

View reviewed changes

benaadams mentioned this pull request Nov 30, 2016

Investigate moving from ConcurrentQueue to ConcurrentBag (post corefx merge) aspnet/KestrelHttpServer#1243

Closed

alexperovich approved these changes Nov 30, 2016

View reviewed changes

benaadams mentioned this pull request Nov 30, 2016

Investigate using ConcurrentBag for DefaultObject pool (post corefx merge) dotnet/extensions#174

Closed

kouvel reviewed Dec 1, 2016

View reviewed changes

stephentoub added 3 commits December 1, 2016 10:05

Fix DefineConstants in System.Collections.Concurrent

a34beb3

DefineConstants was being overwritten, so DEBUG wasn't getting set in debug builds, which meant no asserts were enabled, DEBUG code wasn't compiled in, etc.

Overhaul the ConcurrentBag tests

d7c4437

Clean up a bunch of the existing tests, and add a bunch of new ones. Line coverage now typically hovers between 98% to 100%, and outputs are much more strongly verified.

stephentoub force-pushed the cb_perf branch from 068e8c9 to d7c4437 Compare December 1, 2016 22:39

kouvel approved these changes Dec 2, 2016

View reviewed changes

stephentoub merged commit 9f315d6 into dotnet:master Dec 2, 2016

stephentoub deleted the cb_perf branch December 2, 2016 01:29

karelz modified the milestone: 1.2.0 Dec 3, 2016

benaadams mentioned this pull request Dec 7, 2016

Rewrite ConcurrentQueue<T> for better performance #14254

Merged

stephentoub mentioned this pull request Jan 18, 2017

Add ConcurrentBag/Queue.Clear #15292

Merged

aspnet-hello mentioned this pull request Jan 1, 2018

Investigate using ConcurrentBag for DefaultObject pool (post corefx merge) dotnet/aspnetcore#2368

Closed

ReubenBond mentioned this pull request Jul 2, 2018

Orleans gets slower over time dotnet/orleans#4505

Closed

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022

Merge pull request dotnet/corefx#14126 from stephentoub/cb_perf

4045393

Improve ConcurrentBag performance Commit migrated from dotnet/corefx@9f315d6

Improve ConcurrentBag performance #14126

Improve ConcurrentBag performance #14126

Uh oh!

Conversation

stephentoub commented Nov 30, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benaadams commented Nov 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephentoub commented Nov 30, 2016

Uh oh!

benaadams commented Nov 30, 2016

Uh oh!

benaadams commented Nov 30, 2016

Uh oh!

stephentoub commented Nov 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephentoub commented Nov 30, 2016

Uh oh!

kouvel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentoub Dec 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentoub Dec 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kouvel Dec 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

benaadams commented Nov 30, 2016 •

edited

Loading

stephentoub commented Nov 30, 2016 •

edited

Loading

stephentoub Dec 1, 2016 •

edited

Loading

stephentoub Dec 1, 2016 •

edited

Loading

kouvel Dec 1, 2016 •

edited

Loading