Some ThreadPool performance and maintainability improvements #9234

stephentoub · 2017-01-31T23:58:12Z

Two main changes here:

There's a fair amount of cruft in the ThreadPool related to thread aborts. As they don't exist on coreclr, we can clean all of that up. This includes removing one of the two methods from the IThreadPoolWorkItem type, as it was entirely about thread aborts.
In Rewrite ConcurrentQueue<T> for better performance corefx#14254, ConcurrentQueue in CoreFx was rewritten for better performance. The ThreadPool's global queue used an algorithm very much like of the old ConcurrentQueue. This change ports corefx's new ConcurrentQueue back to CoreCLR, and then uses it from the ThreadPool.

In a microbenchmark that just queues a bunch of work items from one (non-ThreadPool) thread and waits for them all to complete (each work item just signals a CountdownEvent), throughput improved by ~2x.

In a microbenchmark that has Environment.ProcessorCount non-ThreadPool threads all queueing, throughput improved by ~50%. (Note that this change does not affect the ThreadPool's local queues, only the global.) For me, ProcessorCount is 8, as I'm on a quad-core hyperthreaded machine.

There's also a small tweak to take advantage of #9224.

cc: @jkotas, @benaadams, @vancem, @kouvel

jkotas

Nice!

stephentoub · 2017-02-01T05:25:59Z

Thanks. I'm going to run a few more rounds of tests tomorrow before merging.

benaadams · 2017-02-01T08:26:26Z

I have a Threadpool testing matrix if that's any use? https://github.com/benaadams/ThreadPoolTaskTesting

marek-safar · 2017-02-01T08:56:23Z

This sort of changes makes it much harder to integrate CoreFX sources into Mono/Xamarin as it's a breaking change for Mono/Xamarin. We'll basically have to undo all your TAE logic removal and put it back. It'd be much more useful to keep it there (it's not really causing any harm) or at least move it to another partial class if you really want to have TAE handling extracted.

cc @karelz

stephentoub · 2017-02-01T12:33:24Z

This sort of changes makes it much harder to integrate CoreFX sources into Mono/Xamarin as it's a breaking change for Mono/Xamarin. We'll basically have to undo all your TAE logic removal and put it back.

Thanks for the heads up. I don't want to make it a lot harder on Mono, but which logic in particular are you referring to?

The catch block, just so that TAEs don't escape?

Or the empty trys with code in finallys? Does Mono care about that level of reliability in the face of TAEs? And if so, what's the plan for handling it in other places in coreclr/corefx where code has been written/updated that's not safe in the presence of thread aborts?

Or the interface change? That one doesn't really have a perf benefit right now (other than smaller dispatch tables), but I expect that it could with additional work. Going from an interface with two methods to an interface with one opens up other possibilities for how work items are represented.

stephentoub · 2017-02-01T13:44:25Z

I have a Threadpool testing matrix if that's any use?

It is. Thanks, @benaadams.

jkotas · 2017-02-01T14:33:55Z

And if so, what's the plan for handling it in other places in coreclr/corefx where code has been written/updated that's not safe in the presence of thread aborts?

BTW: Majority of the full framework does not have hardening against thread aborts either. The thread abort hardening was only really done and tested for the fraction supported for SQL CLR in .NET Framework 2.0. A lot of newer code does not have it; or it just defensively fail fasts when the state gets corrupted by asynchronous exception (e.g. https://github.com/dotnet/coreclr/issues/8873).

jkotas · 2017-02-01T14:35:33Z

The catch block, just so that TAEs don't escape?

I would do ifdefs for these. I think we will want to have some ifdefs for MONO to make the sharing of code easier. This may be one of the cases.

It's not exposed from CoreLib, but a) as long as the code is here it's good to have it in sync, and b) we can use it in ThreadPool rather than having a separate concurrent queue implementation (which is very similar to the old ConcurrentQueue design).

Use the same ConcurrentQueue<T> now being used in corefx in ThreadPool instead of ThreadPool's custom queue implementation (which was close in design to the old ConcurrentQueue<T> implementation).

marek-safar · 2017-02-01T15:19:21Z

I am no expert on your TP implementation but all MarkAborted logic removal looks to me like functionality removal so there now even smaller chance that TAE will work correctly with threads handled via TP.

I am not sure MONO is the best define for this as I'd prefer to use it for Mono specific behaviour but any FEATURE_xxx like define is ok by me.

stephentoub · 2017-02-01T15:25:00Z

On a 32-core box, I get these results (from the relevant test from @benaadams's suite):

Before:

                                                                             Parallelism
                                  Serial          2x         16x         64x        512x
QUWI No Queues (TP)            305.647 k   304.657 k   267.508 k   142.030 k   159.971 k
- Depth    2                   257.311 k   255.905 k   196.120 k   228.258 k   223.412 k
- Depth   16                   249.092 k   229.903 k   235.552 k   221.771 k   184.343 k
- Depth   64                   655.092 k   925.397 k     1.024 M   963.160 k   525.194 k
- Depth  512                    11.960 M    29.558 M    31.997 M     8.901 M     8.944 M

QUWI No Queues                 139.736 k   305.175 k   275.723 k   184.958 k   207.581 k
- Depth    2                   259.010 k   254.372 k   226.757 k   196.575 k   162.695 k
- Depth   16                   192.865 k   264.679 k   612.228 k   396.366 k   136.811 k
- Depth   64                   861.451 k     1.224 M     2.256 M   413.970 k   145.162 k
- Depth  512                     2.967 M    55.032 M     8.361 M   702.845 k   177.688 k

After:

                                                                             Parallelism
                                  Serial          2x         16x         64x        512x
QUWI No Queues (TP)            351.877 k   347.811 k   932.951 k     1.037 M     1.730 M
- Depth    2                   347.531 k   561.450 k   676.007 k   925.822 k   677.570 k
- Depth   16                     1.264 M     1.151 M     1.169 M     1.329 M     1.189 M
- Depth   64                     2.591 M     3.294 M     2.592 M     3.361 M     3.025 M
- Depth  512                    58.090 M    20.114 M    16.741 M    23.177 M    75.889 M

QUWI No Queues                 253.575 k   382.149 k   716.524 k   682.997 k   428.054 k
- Depth    2                   763.994 k   889.461 k   859.170 k   743.172 k   311.368 k
- Depth   16                     1.054 M     1.060 M     1.417 M     2.615 M   366.950 k
- Depth   64                     2.334 M     1.982 M     3.899 M     6.387 M   274.031 k
- Depth  512                    10.866 M     8.318 M    14.327 M    19.905 M   407.480 k

Out of the 50 measurements, 47 are improved, most by a lot. The remaining three (TP + Depth512 + 2x, TP + Depth512 + 16x, NoTP + Depth512 + 2x) regressed... I'm not sure why.

- Use the newly updated SpinLock.TryEnter(ref bool) instead of TryEnter(int, ref bool), as the former is ~3x faster when the lock is already held. - Change the method to avoid the out reference argument and instead just return it - Simplify the logic to consolidate the setting of missedSteal

Let these types be marked beforefieldinit

- Consolidate branches - Pass resulting work item as a return value rather than an out - Avoid an unnecessary write to the bool passed by-ref

benaadams · 2017-02-01T15:54:55Z

Assume is dual proc? Might be hitting massive false sharing at the high throughput (since they are mostly empty actions); with 2x threads queuing batches of 512 items in a tight loop and the other threads trying to dequeue.

What if you change the queue to

[StructLayout(Size=64)]
struct WorkItem
{
    IThreadPoolWorkItem item;
}

internal readonly ConcurrentQueue<WorkItem> workItems = new ConcurrentQueue<WorkItem>();

Size of segment is less important since they are now reused?

- Remove some explicit ctors - Remove some unnecessary casts - Mark some fields readonly - Follow style guidelines for visibility/static ordering in signatures - Move usings to top of file - Delete some stale comments - Added names for bool args at call sites - Use expression-bodied members in a few places - Pass lower bounds to Array.Copy - Remove unnecessary "success" local in QueueUserworkItemHelper

stephentoub · 2017-02-01T17:32:19Z

I've removed the changes related to thread aborts for now, as I don't want those to hold this up. We can revisit subsequently.

stephentoub · 2017-02-01T18:02:07Z

Assume is dual proc?

Yes, two sockets/numa nodes, each with 8 physical cores, each hyperthreaded.

Might be hitting massive false sharing

It's possible. All producers will need to synchronize on the same field, and all consumers on a separate one, but it's possible if there are few enough items in the queue that this is introducing sharing between producers and consumers.

Size of segment is less important since they are now reused?

Maybe, but we're talking at least an 8x increase in memory consumption. While that may help with false sharing, it'll also potentially cause more paging, and without modifying/diverging ConcurrentQueue to itself contain the padding on each element, it'll result in 8x more data movement in and out.

Regardless, I'll try it and see what kind of impact it has. Though I only mentioned the 3 regressions for completeness; I think the significant wins in the other 47 cases more than make up for it. It'd be great to see this be a pure win, though.

karelz · 2017-02-01T18:09:02Z

@benaadams would it make sense to integrate your suite as perf tests in CoreFX repo?
cc: @danmosemsft

stephentoub · 2017-02-01T20:06:55Z

I'll try it and see what kind of impact it has

@benaadams, I tried changing the ConcurrentQueue<IThreadPoolWorkItem> to ConcurrentQueue<Padded64WorkItem>. Results are all over the board, significantly better in many cases, significantly worse in many others. But it didn't significantly change the three cases I previously called out.

kouvel · 2017-02-01T20:16:29Z

src/mscorlib/src/System/Threading/ThreadPool.cs

-                //threads. 
-
-                EnsureVMInitialized();
+                throw new ArgumentNullException(nameof(WaitCallback));


This should probably be nameof(callBack)

Certainly looks like it. This is what was there before, but I can change it. Need to check to see whether any existing CoreFx tests are verifying the name,

benaadams · 2017-02-01T20:30:57Z

@karelz would be happy for them to be. Don't understand how perf tests work in CoreFX, how they run, if they are tracked, published/monitored etc, but would be happy for someone to use/adapt them.

@stephentoub will have a dig. The "deep" tests are more like a Task spawning subtasks, added an api suggestion to reduce contention on the global queue for these https://github.com/dotnet/corefx/issues/12442

Might be worth running a Kestrel plaintext benchmark as it heavily uses threadpool? Don't think it will experience the regression as it goes wide rather than deep /cc @CesarBS @halter73

karelz · 2017-02-01T20:48:43Z

@benaadams we're just reviving the CoreFX perf tests story - @danmosemsft is driving that, he should know how/when we are ready to onboard more tests.
For short-term/mid-term, I don't expect us to publish the results in a CI way (we just don't have the public infra for that yet - struggling with internal first). Monitoring is being discussed.

The key value of having perf tests I see right now is to have uniform way to create perf tests and pass them along / ask others to run them.

danmoseley · 2017-02-01T21:28:07Z

@DrewScoggins is getting the perf tests up again.

DrewScoggins · 2017-02-01T21:30:34Z

Yes. I have been working on this for a bit, and had to a bit of work to make things work in the new dev eng world, but it looks like I have got things were they need to be. I am doing some final local testing, and will then making some PRs to get this up and running.

- Fix ArgumentNullException parameter name

cesarblum · 2017-02-02T00:00:37Z

src/mscorlib/src/System/Threading/ThreadPool.cs

+using System.Diagnostics;
+using System.Diagnostics.Contracts;
+using System.Diagnostics.CodeAnalysis;
+using System.Diagnostics.Tracing;


Thanks, @CesarBS. As I just moved them from where they were previously, I'll merge this as-is and sort them in a subsequent change. I have some other things to explore in improving throughput / reducing overheads.

stephentoub · 2017-02-02T04:40:36Z

@benaadams, @CesarBS, @vancem, with a 4-core server in Azure, I ran a bunch of pre- and post- tests on plaintext, and this change improved requests/sec on average by a little more than 5%.

Some ThreadPool performance and maintainability improvements Commit migrated from dotnet/coreclr@353bd0b

dnfclas added the cla-already-signed label Jan 31, 2017

jkotas approved these changes Feb 1, 2017

View reviewed changes

karelz added the area-System.Threading label Feb 1, 2017

stephentoub added 2 commits February 1, 2017 09:59

Use ConcurrentQueue<T> in ThreadPool

cba50e6

Use the same ConcurrentQueue<T> now being used in corefx in ThreadPool instead of ThreadPool's custom queue implementation (which was close in design to the old ConcurrentQueue<T> implementation).

stephentoub added 3 commits February 1, 2017 10:32

Remove unnecessary explicit static cctors

6f4432b

Let these types be marked beforefieldinit

Simplify Dequeue method

ff1caae

- Consolidate branches - Pass resulting work item as a return value rather than an out - Avoid an unnecessary write to the bool passed by-ref

stephentoub force-pushed the tp_perf branch from d365f34 to ade717c Compare February 1, 2017 17:31

kouvel reviewed Feb 1, 2017

View reviewed changes

kouvel approved these changes Feb 1, 2017

View reviewed changes

Address PR feedback

473843b

- Fix ArgumentNullException parameter name

cesarblum reviewed Feb 2, 2017

View reviewed changes

stephentoub merged commit 353bd0b into dotnet:master Feb 2, 2017

stephentoub deleted the tp_perf branch February 2, 2017 04:42

jkotas mentioned this pull request Feb 2, 2017

Port CoreCLR threading perf improvements dotnet/corert#2640

Open

jkotas mentioned this pull request Apr 2, 2017

Remove thread-abort from corelib #10648

Closed

kouvel mentioned this pull request Jul 22, 2017

Port Perf fixes over from CoreCLR. dotnet/corert#4150

Merged

karelz modified the milestone: 2.0.0 Aug 28, 2017

AlexGhiondea added the netfx-port-consider label Oct 4, 2017

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022

Merge pull request dotnet/coreclr#9234 from stephentoub/tp_perf

ebc64f7

Some ThreadPool performance and maintainability improvements Commit migrated from dotnet/coreclr@353bd0b

Some ThreadPool performance and maintainability improvements #9234

Some ThreadPool performance and maintainability improvements #9234

Uh oh!

Conversation

stephentoub commented Jan 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas left a comment

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Feb 1, 2017

Uh oh!

benaadams commented Feb 1, 2017

Uh oh!

marek-safar commented Feb 1, 2017

Uh oh!

stephentoub commented Feb 1, 2017

Uh oh!

stephentoub commented Feb 1, 2017

Uh oh!

jkotas commented Feb 1, 2017

Uh oh!

jkotas commented Feb 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marek-safar commented Feb 1, 2017

Uh oh!

stephentoub commented Feb 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Feb 1, 2017

Uh oh!

stephentoub commented Feb 1, 2017

Uh oh!

stephentoub commented Feb 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karelz commented Feb 1, 2017

Uh oh!

stephentoub commented Feb 1, 2017

Uh oh!

kouvel Feb 1, 2017

Choose a reason for hiding this comment

Uh oh!

stephentoub Feb 1, 2017

Choose a reason for hiding this comment

Uh oh!

benaadams commented Feb 1, 2017

Uh oh!

karelz commented Feb 1, 2017

Uh oh!

danmoseley commented Feb 1, 2017

Uh oh!

DrewScoggins commented Feb 1, 2017

Uh oh!

cesarblum Feb 2, 2017

Choose a reason for hiding this comment

Uh oh!

stephentoub Feb 2, 2017

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Feb 2, 2017

Uh oh!

Uh oh!

stephentoub commented Jan 31, 2017 •

edited

Loading

jkotas commented Feb 1, 2017 •

edited

Loading

stephentoub commented Feb 1, 2017 •

edited

Loading

stephentoub commented Feb 1, 2017 •

edited

Loading