-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Some ThreadPool performance and maintainability improvements #9234
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Thanks. I'm going to run a few more rounds of tests tomorrow before merging. |
I have a Threadpool testing matrix if that's any use? https://github.com/benaadams/ThreadPoolTaskTesting |
This sort of changes makes it much harder to integrate CoreFX sources into Mono/Xamarin as it's a breaking change for Mono/Xamarin. We'll basically have to undo all your TAE logic removal and put it back. It'd be much more useful to keep it there (it's not really causing any harm) or at least move it to another partial class if you really want to have TAE handling extracted. cc @karelz |
Thanks for the heads up. I don't want to make it a lot harder on Mono, but which logic in particular are you referring to? The catch block, just so that TAEs don't escape? Or the empty trys with code in finallys? Does Mono care about that level of reliability in the face of TAEs? And if so, what's the plan for handling it in other places in coreclr/corefx where code has been written/updated that's not safe in the presence of thread aborts? Or the interface change? That one doesn't really have a perf benefit right now (other than smaller dispatch tables), but I expect that it could with additional work. Going from an interface with two methods to an interface with one opens up other possibilities for how work items are represented. |
It is. Thanks, @benaadams. |
BTW: Majority of the full framework does not have hardening against thread aborts either. The thread abort hardening was only really done and tested for the fraction supported for SQL CLR in .NET Framework 2.0. A lot of newer code does not have it; or it just defensively fail fasts when the state gets corrupted by asynchronous exception (e.g. https://github.com/dotnet/coreclr/issues/8873). |
I would do ifdefs for these. I think we will want to have some ifdefs for MONO to make the sharing of code easier. This may be one of the cases. |
It's not exposed from CoreLib, but a) as long as the code is here it's good to have it in sync, and b) we can use it in ThreadPool rather than having a separate concurrent queue implementation (which is very similar to the old ConcurrentQueue design).
Use the same ConcurrentQueue<T> now being used in corefx in ThreadPool instead of ThreadPool's custom queue implementation (which was close in design to the old ConcurrentQueue<T> implementation).
I am no expert on your TP implementation but all MarkAborted logic removal looks to me like functionality removal so there now even smaller chance that TAE will work correctly with threads handled via TP. I am not sure MONO is the best define for this as I'd prefer to use it for Mono specific behaviour but any FEATURE_xxx like define is ok by me. |
On a 32-core box, I get these results (from the relevant test from @benaadams's suite): Before:
After:
Out of the 50 measurements, 47 are improved, most by a lot. The remaining three (TP + Depth512 + 2x, TP + Depth512 + 16x, NoTP + Depth512 + 2x) regressed... I'm not sure why. |
- Use the newly updated SpinLock.TryEnter(ref bool) instead of TryEnter(int, ref bool), as the former is ~3x faster when the lock is already held. - Change the method to avoid the out reference argument and instead just return it - Simplify the logic to consolidate the setting of missedSteal
Let these types be marked beforefieldinit
- Consolidate branches - Pass resulting work item as a return value rather than an out - Avoid an unnecessary write to the bool passed by-ref
Assume is dual proc? Might be hitting massive false sharing at the high throughput (since they are mostly empty actions); with 2x threads queuing batches of 512 items in a tight loop and the other threads trying to dequeue. What if you change the queue to [StructLayout(Size=64)]
struct WorkItem
{
IThreadPoolWorkItem item;
}
internal readonly ConcurrentQueue<WorkItem> workItems = new ConcurrentQueue<WorkItem>(); Size of segment is less important since they are now reused? |
- Remove some explicit ctors - Remove some unnecessary casts - Mark some fields readonly - Follow style guidelines for visibility/static ordering in signatures - Move usings to top of file - Delete some stale comments - Added names for bool args at call sites - Use expression-bodied members in a few places - Pass lower bounds to Array.Copy - Remove unnecessary "success" local in QueueUserworkItemHelper
I've removed the changes related to thread aborts for now, as I don't want those to hold this up. We can revisit subsequently. |
Yes, two sockets/numa nodes, each with 8 physical cores, each hyperthreaded.
It's possible. All producers will need to synchronize on the same field, and all consumers on a separate one, but it's possible if there are few enough items in the queue that this is introducing sharing between producers and consumers.
Maybe, but we're talking at least an 8x increase in memory consumption. While that may help with false sharing, it'll also potentially cause more paging, and without modifying/diverging ConcurrentQueue to itself contain the padding on each element, it'll result in 8x more data movement in and out. Regardless, I'll try it and see what kind of impact it has. Though I only mentioned the 3 regressions for completeness; I think the significant wins in the other 47 cases more than make up for it. It'd be great to see this be a pure win, though. |
@benaadams would it make sense to integrate your suite as perf tests in CoreFX repo? |
@benaadams, I tried changing the |
//threads. | ||
|
||
EnsureVMInitialized(); | ||
throw new ArgumentNullException(nameof(WaitCallback)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be nameof(callBack)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly looks like it. This is what was there before, but I can change it. Need to check to see whether any existing CoreFx tests are verifying the name,
@karelz would be happy for them to be. Don't understand how perf tests work in CoreFX, how they run, if they are tracked, published/monitored etc, but would be happy for someone to use/adapt them. @stephentoub will have a dig. The "deep" tests are more like a Task spawning subtasks, added an api suggestion to reduce contention on the global queue for these https://github.com/dotnet/corefx/issues/12442 Might be worth running a Kestrel plaintext benchmark as it heavily uses threadpool? Don't think it will experience the regression as it goes wide rather than deep /cc @CesarBS @halter73 |
@benaadams we're just reviving the CoreFX perf tests story - @danmosemsft is driving that, he should know how/when we are ready to onboard more tests. The key value of having perf tests I see right now is to have uniform way to create perf tests and pass them along / ask others to run them. |
@DrewScoggins is getting the perf tests up again. |
Yes. I have been working on this for a bit, and had to a bit of work to make things work in the new dev eng world, but it looks like I have got things were they need to be. I am doing some final local testing, and will then making some PRs to get this up and running. |
- Fix ArgumentNullException parameter name
using System.Diagnostics; | ||
using System.Diagnostics.Contracts; | ||
using System.Diagnostics.CodeAnalysis; | ||
using System.Diagnostics.Tracing; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: sort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @CesarBS. As I just moved them from where they were previously, I'll merge this as-is and sort them in a subsequent change. I have some other things to explore in improving throughput / reducing overheads.
@benaadams, @CesarBS, @vancem, with a 4-core server in Azure, I ran a bunch of pre- and post- tests on plaintext, and this change improved requests/sec on average by a little more than 5%. |
Some ThreadPool performance and maintainability improvements Commit migrated from dotnet/coreclr@353bd0b
Two main changes here:
In a microbenchmark that just queues a bunch of work items from one (non-ThreadPool) thread and waits for them all to complete (each work item just signals a CountdownEvent), throughput improved by ~2x.
In a microbenchmark that has Environment.ProcessorCount non-ThreadPool threads all queueing, throughput improved by ~50%. (Note that this change does not affect the ThreadPool's local queues, only the global.) For me, ProcessorCount is 8, as I'm on a quad-core hyperthreaded machine.
There's also a small tweak to take advantage of #9224.
cc: @jkotas, @benaadams, @vancem, @kouvel