-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Improve ConcurrentBag performance #14126
Conversation
{ | ||
if (currentList._head != null) | ||
//at least this list is not empty, we return false | ||
FreezeBag(ref lockTaken); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this fail to take the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this fail to take the lock?
Only in extreme cases, e.g. OOM while JIT'ing one of the helper methods. It's primarily a holdover from running on desktop where thread aborts are possible and the more common reason for failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do something when this fails? It doesn't look like we bail out if lockTaken is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll only be false if an exception is getting thrown, e.g. an assert for true after this call would always pass.
Is it worth doing the double pass for e.g. first pass with |
It would need to be measured, though my guess is that it originally did that, it was measured and shown to be bad or not worthwhile, and the code was updated to remove it but the comment got left behind as stale (I think that's more likely than someone just adding the comment and not implementing what the comment said). Regardless, it'd be a fine thing to experiment with, but separately from this change. Such a change would very likely improve some situations and degrade others, so we'd need to make sure we were measuring the right scenarios. |
TryEnter did use to have a lot of cost with it; which has now changed.
Makes sense. Very happy with this change; ConcurrentBag is in theory a much better pooling structure; though in practice ConcurrentQueue works better - will be nice to bring back its advantages by cutting its costs 😄 |
LGTM - very nice @stephentoub could you add your perf tests as a gist so they can be reused for future perf investigations? |
Adding items from a single thread: using System;
using System.Diagnostics;
using System.Collections.Concurrent;
class Test
{
public static void Main()
{
var sw = new Stopwatch();
while (true)
{
var bag = new ConcurrentBag<int>();
int gen0 = GC.CollectionCount(0);
sw.Restart();
for (int i = 0; i < 10000000; i++) bag.Add(i);
sw.Stop();
Console.WriteLine(sw.Elapsed + " " + (GC.CollectionCount(0) - gen0));
}
}
} Adding and taking items: using System;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
using System.Collections.Concurrent;
using System.Linq;
class Test
{
public static void Main()
{
var sw = new Stopwatch();
var bag = new ConcurrentBag<int>();
while (true)
{
int gen0 = GC.CollectionCount(0);
sw.Restart();
Task.WaitAll(Enumerable.Range(0, Environment.ProcessorCount).Select(_ => Task.Run(() =>
{
for (int i = 0; i < 5000000; i++)
{
bag.Add(i);
bag.Add(i);
int item;
bag.TryTake(out item);
bag.TryTake(out item);
}
})).ToArray());
sw.Stop();
Console.WriteLine(sw.Elapsed + " " + (GC.CollectionCount(0) - gen0));
}
}
} Adding and stealing items: using System;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
using System.Collections.Concurrent;
class Test
{
public static void Main()
{
var sw = new Stopwatch();
var bag = new ConcurrentBag<int>();
while (true)
{
int gen0 = GC.CollectionCount(0);
sw.Restart();
Task.WaitAll(
Task.Run(() =>
{
for (int i = 0; i < 10000000; i++) bag.Add(i);
}),
Task.Run(() =>
{
int count = 0;
while (count < 10000000)
{
int item;
if (bag.TryTake(out item)) count++;
}
}));
sw.Stop();
Console.WriteLine(sw.Elapsed + " " + (GC.CollectionCount(0) - gen0));
}
}
} |
True... probably worth another (separate) look given improvements that have been made there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest looks good to me
_head._prev = node; | ||
_head = node; | ||
// Full fence to ensure subsequent reads don't get reordered before this | ||
Interlocked.Exchange(ref _currentOp, (int)ConcurrentBagListOperation.Add); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like these ordering restrictions are critical to prevent some races with TrySteal. It would be helpful to describe just one case that demonstrates that this memory barrier between the write and next read is necessary to avoid a race (and likewise for other similar barriers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the concern is that, with just the acquire/release semantics from the volatile reads/writes, it's theoretically possible for the subsequent reads of _tailIndex and _headIndex to move up before the write to _currentOp. If that were to happen, a steal could end up mucking with _headIndex without waiting for this in-flight local push to quiesce, even if _headIndex and _tailIndex were close enough together that the push and steal should have synchronized with each other. I'll add a comment.
// We're going to increment the tail; if we'll overflow, then we need to reset our counts | ||
if (tail == int.MaxValue) | ||
{ | ||
lock (this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does _currentOp
need to be reset to None here too, to avoid a deadlock with TrySteal, and set again after releasing the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely does. I'll fix. Good catch.
|
||
// When there are at least 2 elements' worth of space (and there aren't pending operations that | ||
// would force us to synchronize), we can take the fast path without locking. | ||
if (!needsSync && tail < (_headIndex + _mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary for there to be 2 elements' worth of space? I also don't understand why inside the lock, the array is considered full when there are still 2 elements' worth of space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same question, actually. This is copied from ThreadPool:
https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Threading/ThreadPool.cs#L187-L188
but I think it could probably be changed there as well. No one else besides this thread is going to change _tailIndex, and even if a concurrent operation steals, that can only increase _headIndex (making more room), so having a single slot remaining (i.e. doing "tail <= (_headIndex + _mask)") should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no, I see why it's necessary. Let's say that _array.Length == 32, _headIndex == 0, and _mask == 31. Let's also say that the array is full, such that _tailIndex == 32. In a normal case, "_tailIndex <= _headIndex + mask" would return false (32 > 0 + 31) as we'd want. But let's say that just as we're doing this comparison, a stealing thread comes along and increments _headIndex. Then _tailIndex <= _headIndex + mask would return true (32 <= 1 + 31), and we'd end up writing to the element that's currently being stolen (the stealing thread increments the _headIndex before it reads the value). That's why we need two slots: one to write into, and one to account for the one that may currently be in the process of being stolen. I'll add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense
if (_currentOp != (int)ConcurrentBagListOperation.None) | ||
{ | ||
var spinner = new SpinWait(); | ||
while (_currentOp != (int)ConcurrentBagListOperation.None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the owning thread is in a reasonable infinite loop of push/pop operations, is it possible for a different stealing thread to go into an unreasonable infinite loop here as a consequence? That is, does something need to be added here to prevent a potential infinite loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's theoretically possible. This code hasn't changed, though, so that same issue exists in ConcurrentBag today. There are plenty of reasons why it's unlikely, e.g. the thread would have to constantly be pushing/popping, without touching the ends of the array that would force synchronization. If we're really concerned about it, I can imagine a few possible mitigations, e.g. add a counter to ThreadLocalList that's incremented on any local push/pop, and force a synchronization every N operations, where N is something reasonably large so as not to penalize the common case. We could also tie this in with @benaadams's suggestion of looking at whether we should do two passes during a steal: the first one doing a TryEnter(..., 0) and the second one doing an Enter(...)... we could also say that the first pass fails not only if TryEnter(..., 0) returns false but also if _currentOp != None. Again, though, this possible issue has existed since the type was introduced in .NET 4, and I've not heard of it being a problem, so I'd like to punt on it for this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, looking at this more, I'm not sure why the steal needs to do this loop at all. It makes sense for operations that need the bag to quiesce, e.g. ToArray. We may just be able to remove the loop entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it appears to be necessary, but only for peeks. Peeks are special, in that they need to be able to read the value without taking it or preventing someone else from taking it, which means they can't mess with the head/tail pointers in a way that would prevent a steal from getting the item. As such, a peek needs to set this flag, so that if the peek contends with a steal, the steal waits for the peek to finish; otherwise, the peek can end up reading a default(T) or a torn T. I can change the condition here to old spin for peeks. It's not necessary to do so for pops or pushes, as the logic for handling contention with those is already built in to the rest of the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense as well, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually going to just simplify TryLocalPeek to make it use a lock. The logic to get a lock-free peek correct is complicated, due to not being able to do the same kind of index reservation that's done in TryLocalPop (if we reserve an index for a peek, that could cause a steal to fail even if the item is available). And by making it use a lock, we can remove this spinning from TrySteal entirely, avoiding any concerns about stalling threads indefinitely. Peeks are rare on a ConcurrentBag, so the overhead of using a lock always instead of just on the "slow" path should not be impactful; if it surprises us and ends up being so, we can revisit.
{ | ||
// We contended with the local thread, so restore the head | ||
// and loop around to try again. | ||
_headIndex = head; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the head is incremented unconditionally, and similarly, the tail decremented unconditionally in TryLocalPop, it seems possible for a race between the two to cause the head index to temporarily be greater than the tail index. I didn't find a case where that becomes an issue, but I can't be 100% sure. Is there some obvious reason that I missed why this is never an issue? If so, I would find it helpful to have a comment about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems possible for a race between the two to cause the head index to temporarily be greater than the tail index
This is part of why local pops synchronize when there are <= 2 elements in the list. I don't see any way this would be a problem. Note again that this is the same logic in ThreadPool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a minor issue:
- There's one item in the bag
- Pop decrements the tail and reads the head
- Steal increments the head and reads the new tail
- At this point, say head == 1 and tail == 0
- Now steal fails and it would want to reset the head to 0, but the thread gets switched out before that
- Pop completes, then on the same thread a Push occurs successfully, setting tail to 1
- Push completes, and another Pop is done. This Pop now sees the list as empty because it sees head == tail == 1, but the list actually has one item. Not sure if this is allowed by the contract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that failed Pop then Steal from itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TrySteal would also see the list as empty and fail fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the beginning of TryLocalPop, _headIndex
is 0 and _tailIndex
is 1. The condition for the fast path is equivalent to _headIndex <= _tailIndex - 1
, so it looks to me like it would go into the fast path when there is one item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also since TryPeek calls TrySteal, which increments the head even for a peek, it could hide one item temporarily during a peek operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition for the fast path is equivalent to _headIndex <= _tailIndex - 1
Ah, you and I are looking at different versions of this: my local copy already has this changed to _headIndex < tail
, which I did to account for stealing peeks. That would also address the issue you're talking about. I'll push my updates shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also since TryPeek calls TrySteal
That's the stealing peeks issue I just mentioned :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, that works
{ | ||
return true; | ||
} | ||
currentList = currentList._nextList; | ||
} | ||
|
||
// verify versioning, if other items are added to this list since we last visit it, we should retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this versioning seems to cache failed attempts to steal (presumably because those lists were empty) and keeps trying as long as a list from any thread got a version update (transitioned from empty to nonempty). Is there a benefit to doing this? Why not just fail after the first attempt at stealing and ditch versioning altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to come up with a scenario where the versioning helps prevent a "wrong" answer, i.e. where a TryTake is issued after an item has been added but where it's unable to steal that item and where versioning would avoid that... and I've not come up with one. This does seem superfluous. I'll remove it if I can't come up with a good reason for it.
DefineConstants was being overwritten, so DEBUG wasn't getting set in debug builds, which meant no asserts were enabled, DEBUG code wasn't compiled in, etc.
The primary purpose of this change is to remove allocations prevalent in the use of ConcurrentBag: - Every Add allocates a Node object to be stored in a linked list. - Every Take that needs to steal allocations a List (and the associated underlying array) to store version information for all lists. This commit removes all of those allocations. The first allocation (per-Add Node) is addressed by changing from a linked-list-based scheme to an array-based scheme; the logic for the latter is taken from ThreadPool, which uses an array-based work-stealing implementation for the thread-local queues in the pool. The second allocation is avoided by simply removing the versioning, as it is unnecessary in the stealing process. They also have secondary effects that significantly improve throughput, e.g. better memory locality with the array vs with a linked list, less code needed for pushes/pops.
Clean up a bunch of the existing tests, and add a bunch of new ones. Line coverage now typically hovers between 98% to 100%, and outputs are much more strongly verified.
Thanks for the reviews, @kouvel. I've updated the PR to address the feedback, add some more tests, and address a few issues I discovered while adding the tests. |
Thanks, @kouvel, @alexperovich, and @benaadams. |
Improve ConcurrentBag performance Commit migrated from dotnet/corefx@9f315d6
The primary purpose of this change is to remove allocations prevalent in the use of ConcurrentBag:
This PR removes all of those allocations:
In addition to the GC impact, these changes have a beneficial impact on throughput, due to better memory locality with the array vs with a linked list, less code needed for pushes/pops, etc.
In a test that adds 10M items from a single thread:
Before (elapsed time and gen0 GCs):
After:
In a test that has 4 threads each adding 2 items then taking 2 items (no steals), repeatedly, 5M times:
Before:
After:
In a test that has one thread adding 10M items and another thread taking until all 10M are stolen (effectively the worst case for ConcurrentBag):
Before:
After:
Prior to this change, devs were often hesitant to use ConcurrentBag as an object pool for small objects, as each added object involved its own small Node allocation. After this change, there shouldn't be any such objections.
cc: @kouvel, @alexperovich, @benaadams
Fixes #14090
Fixes https://github.com/dotnet/corefx/issues/7917