Thanks to visit codestin.com
Credit goes to github.com

Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Jan 31, 2017

Two main changes here:

  • There's a fair amount of cruft in the ThreadPool related to thread aborts. As they don't exist on coreclr, we can clean all of that up. This includes removing one of the two methods from the IThreadPoolWorkItem type, as it was entirely about thread aborts.
  • In Rewrite ConcurrentQueue<T> for better performance corefx#14254, ConcurrentQueue in CoreFx was rewritten for better performance. The ThreadPool's global queue used an algorithm very much like of the old ConcurrentQueue. This change ports corefx's new ConcurrentQueue back to CoreCLR, and then uses it from the ThreadPool.

In a microbenchmark that just queues a bunch of work items from one (non-ThreadPool) thread and waits for them all to complete (each work item just signals a CountdownEvent), throughput improved by ~2x.

In a microbenchmark that has Environment.ProcessorCount non-ThreadPool threads all queueing, throughput improved by ~50%. (Note that this change does not affect the ThreadPool's local queues, only the global.) For me, ProcessorCount is 8, as I'm on a quad-core hyperthreaded machine.

There's also a small tweak to take advantage of #9224.

cc: @jkotas, @benaadams, @vancem, @kouvel

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@stephentoub
Copy link
Member Author

Thanks. I'm going to run a few more rounds of tests tomorrow before merging.

@benaadams
Copy link
Member

I have a Threadpool testing matrix if that's any use? https://github.com/benaadams/ThreadPoolTaskTesting

@marek-safar
Copy link

This sort of changes makes it much harder to integrate CoreFX sources into Mono/Xamarin as it's a breaking change for Mono/Xamarin. We'll basically have to undo all your TAE logic removal and put it back. It'd be much more useful to keep it there (it's not really causing any harm) or at least move it to another partial class if you really want to have TAE handling extracted.

cc @karelz

@stephentoub
Copy link
Member Author

This sort of changes makes it much harder to integrate CoreFX sources into Mono/Xamarin as it's a breaking change for Mono/Xamarin. We'll basically have to undo all your TAE logic removal and put it back.

Thanks for the heads up. I don't want to make it a lot harder on Mono, but which logic in particular are you referring to?

The catch block, just so that TAEs don't escape?

Or the empty trys with code in finallys? Does Mono care about that level of reliability in the face of TAEs? And if so, what's the plan for handling it in other places in coreclr/corefx where code has been written/updated that's not safe in the presence of thread aborts?

Or the interface change? That one doesn't really have a perf benefit right now (other than smaller dispatch tables), but I expect that it could with additional work. Going from an interface with two methods to an interface with one opens up other possibilities for how work items are represented.

@stephentoub
Copy link
Member Author

I have a Threadpool testing matrix if that's any use?

It is. Thanks, @benaadams.

@jkotas
Copy link
Member

jkotas commented Feb 1, 2017

And if so, what's the plan for handling it in other places in coreclr/corefx where code has been written/updated that's not safe in the presence of thread aborts?

BTW: Majority of the full framework does not have hardening against thread aborts either. The thread abort hardening was only really done and tested for the fraction supported for SQL CLR in .NET Framework 2.0. A lot of newer code does not have it; or it just defensively fail fasts when the state gets corrupted by asynchronous exception (e.g. https://github.com/dotnet/coreclr/issues/8873).

@jkotas
Copy link
Member

jkotas commented Feb 1, 2017

The catch block, just so that TAEs don't escape?

I would do ifdefs for these. I think we will want to have some ifdefs for MONO to make the sharing of code easier. This may be one of the cases.

It's not exposed from CoreLib, but a) as long as the code is here it's good to have it in sync, and b) we can use it in ThreadPool rather than having a separate concurrent queue implementation (which is very similar to the old ConcurrentQueue design).
Use the same ConcurrentQueue<T> now being used in corefx in ThreadPool instead of ThreadPool's custom queue implementation (which was close in design to the old ConcurrentQueue<T> implementation).
@marek-safar
Copy link

I am no expert on your TP implementation but all MarkAborted logic removal looks to me like functionality removal so there now even smaller chance that TAE will work correctly with threads handled via TP.

I am not sure MONO is the best define for this as I'd prefer to use it for Mono specific behaviour but any FEATURE_xxx like define is ok by me.

@stephentoub
Copy link
Member Author

stephentoub commented Feb 1, 2017

On a 32-core box, I get these results (from the relevant test from @benaadams's suite):

Before:

                                                                             Parallelism
                                  Serial          2x         16x         64x        512x
QUWI No Queues (TP)            305.647 k   304.657 k   267.508 k   142.030 k   159.971 k
- Depth    2                   257.311 k   255.905 k   196.120 k   228.258 k   223.412 k
- Depth   16                   249.092 k   229.903 k   235.552 k   221.771 k   184.343 k
- Depth   64                   655.092 k   925.397 k     1.024 M   963.160 k   525.194 k
- Depth  512                    11.960 M    29.558 M    31.997 M     8.901 M     8.944 M

QUWI No Queues                 139.736 k   305.175 k   275.723 k   184.958 k   207.581 k
- Depth    2                   259.010 k   254.372 k   226.757 k   196.575 k   162.695 k
- Depth   16                   192.865 k   264.679 k   612.228 k   396.366 k   136.811 k
- Depth   64                   861.451 k     1.224 M     2.256 M   413.970 k   145.162 k
- Depth  512                     2.967 M    55.032 M     8.361 M   702.845 k   177.688 k

After:

                                                                             Parallelism
                                  Serial          2x         16x         64x        512x
QUWI No Queues (TP)            351.877 k   347.811 k   932.951 k     1.037 M     1.730 M
- Depth    2                   347.531 k   561.450 k   676.007 k   925.822 k   677.570 k
- Depth   16                     1.264 M     1.151 M     1.169 M     1.329 M     1.189 M
- Depth   64                     2.591 M     3.294 M     2.592 M     3.361 M     3.025 M
- Depth  512                    58.090 M    20.114 M    16.741 M    23.177 M    75.889 M

QUWI No Queues                 253.575 k   382.149 k   716.524 k   682.997 k   428.054 k
- Depth    2                   763.994 k   889.461 k   859.170 k   743.172 k   311.368 k
- Depth   16                     1.054 M     1.060 M     1.417 M     2.615 M   366.950 k
- Depth   64                     2.334 M     1.982 M     3.899 M     6.387 M   274.031 k
- Depth  512                    10.866 M     8.318 M    14.327 M    19.905 M   407.480 k

Out of the 50 measurements, 47 are improved, most by a lot. The remaining three (TP + Depth512 + 2x, TP + Depth512 + 16x, NoTP + Depth512 + 2x) regressed... I'm not sure why.

- Use the newly updated SpinLock.TryEnter(ref bool) instead of TryEnter(int, ref bool), as the former is ~3x faster when the lock is already held.
- Change the method to avoid the out reference argument and instead just return it
- Simplify the logic to consolidate the setting of missedSteal
Let these types be marked beforefieldinit
- Consolidate branches
- Pass resulting work item as a return value rather than an out
- Avoid an unnecessary write to the bool passed by-ref
@benaadams
Copy link
Member

Assume is dual proc? Might be hitting massive false sharing at the high throughput (since they are mostly empty actions); with 2x threads queuing batches of 512 items in a tight loop and the other threads trying to dequeue.

What if you change the queue to

[StructLayout(Size=64)]
struct WorkItem
{
    IThreadPoolWorkItem item;
}

internal readonly ConcurrentQueue<WorkItem> workItems = new ConcurrentQueue<WorkItem>();

Size of segment is less important since they are now reused?

- Remove some explicit ctors
- Remove some unnecessary casts
- Mark some fields readonly
- Follow style guidelines for visibility/static ordering in signatures
- Move usings to top of file
- Delete some stale comments
- Added names for bool args at call sites
- Use expression-bodied members in a few places
- Pass lower bounds to Array.Copy
- Remove unnecessary "success" local in QueueUserworkItemHelper
@stephentoub
Copy link
Member Author

I've removed the changes related to thread aborts for now, as I don't want those to hold this up. We can revisit subsequently.

@stephentoub
Copy link
Member Author

stephentoub commented Feb 1, 2017

Assume is dual proc?

Yes, two sockets/numa nodes, each with 8 physical cores, each hyperthreaded.

Might be hitting massive false sharing

It's possible. All producers will need to synchronize on the same field, and all consumers on a separate one, but it's possible if there are few enough items in the queue that this is introducing sharing between producers and consumers.

Size of segment is less important since they are now reused?

Maybe, but we're talking at least an 8x increase in memory consumption. While that may help with false sharing, it'll also potentially cause more paging, and without modifying/diverging ConcurrentQueue to itself contain the padding on each element, it'll result in 8x more data movement in and out.

Regardless, I'll try it and see what kind of impact it has. Though I only mentioned the 3 regressions for completeness; I think the significant wins in the other 47 cases more than make up for it. It'd be great to see this be a pure win, though.

@karelz
Copy link
Member

karelz commented Feb 1, 2017

@benaadams would it make sense to integrate your suite as perf tests in CoreFX repo?
cc: @danmosemsft

@stephentoub
Copy link
Member Author

I'll try it and see what kind of impact it has

@benaadams, I tried changing the ConcurrentQueue<IThreadPoolWorkItem> to ConcurrentQueue<Padded64WorkItem>. Results are all over the board, significantly better in many cases, significantly worse in many others. But it didn't significantly change the three cases I previously called out.

//threads.

EnsureVMInitialized();
throw new ArgumentNullException(nameof(WaitCallback));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be nameof(callBack)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly looks like it. This is what was there before, but I can change it. Need to check to see whether any existing CoreFx tests are verifying the name,

@benaadams
Copy link
Member

@karelz would be happy for them to be. Don't understand how perf tests work in CoreFX, how they run, if they are tracked, published/monitored etc, but would be happy for someone to use/adapt them.

@stephentoub will have a dig. The "deep" tests are more like a Task spawning subtasks, added an api suggestion to reduce contention on the global queue for these https://github.com/dotnet/corefx/issues/12442

Might be worth running a Kestrel plaintext benchmark as it heavily uses threadpool? Don't think it will experience the regression as it goes wide rather than deep /cc @CesarBS @halter73

@karelz
Copy link
Member

karelz commented Feb 1, 2017

@benaadams we're just reviving the CoreFX perf tests story - @danmosemsft is driving that, he should know how/when we are ready to onboard more tests.
For short-term/mid-term, I don't expect us to publish the results in a CI way (we just don't have the public infra for that yet - struggling with internal first). Monitoring is being discussed.

The key value of having perf tests I see right now is to have uniform way to create perf tests and pass them along / ask others to run them.

@danmoseley
Copy link
Member

@DrewScoggins is getting the perf tests up again.

@DrewScoggins
Copy link
Member

Yes. I have been working on this for a bit, and had to a bit of work to make things work in the new dev eng world, but it looks like I have got things were they need to be. I am doing some final local testing, and will then making some PRs to get this up and running.

- Fix ArgumentNullException parameter name
using System.Diagnostics;
using System.Diagnostics.Contracts;
using System.Diagnostics.CodeAnalysis;
using System.Diagnostics.Tracing;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sort

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @CesarBS. As I just moved them from where they were previously, I'll merge this as-is and sort them in a subsequent change. I have some other things to explore in improving throughput / reducing overheads.

@stephentoub
Copy link
Member Author

@benaadams, @CesarBS, @vancem, with a 4-core server in Azure, I ran a bunch of pre- and post- tests on plaintext, and this change improved requests/sec on average by a little more than 5%.

@stephentoub stephentoub merged commit 353bd0b into dotnet:master Feb 2, 2017
@stephentoub stephentoub deleted the tp_perf branch February 2, 2017 04:42
@karelz karelz modified the milestone: 2.0.0 Aug 28, 2017
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Some ThreadPool performance and maintainability improvements

Commit migrated from dotnet/coreclr@353bd0b
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.