Migrate coreclr's worker thread pool to be able to use the portable thread pool in opt-in fashion #38225

kouvel · 2020-06-22T18:34:18Z

Enables using the portable thread pool with coreclr as opt-in. Change is off by default for now, and can be enabled with COMPlus_ThreadPool_UsePortableThreadPool=1. Once it's had bake time and seen to be stable, at a reasonable time in the future the config flag would ideally be removed and the relevant parts of native implementation deleted.
The IO thread pool is not being migrated in this change, and remains on the native side
My goal was to get compatible behavior, compatible with diagnostics tools, and similar perf to the native implementation in coreclr. Tried to avoid changing scheduling behavior, behavior of heuristics, etc., compared with that implementation.
The eventual goal is to have one mostly managed thread pool implementation that can be shared between runtimes, to ease maintenance going forward

Commit descriptions:

"Add dependencies"
- Ported LowLevelLock from CoreRT, and moved LowLevelSpinWaiter to shared. Since we support Thread.Interrupt(), they were necessary in the wait subsystem in CoreRT partly to support that, and were also used in the portable thread pool implementation where a pending thread interrupt on a thread pool thread would otherwise crash the process. Interruptible waits are already used in the managed side of the thread pool in the queue implementations. It may be reasonable to ignore the thread interrupt problem and suggest that it not be used on thread pool threads, but for now I just brought in the dependencies to keep behavior consistent with the native implementation.
"Add config var"
- Added config var COMPlus_ThreadPool_UsePortableThreadPool (disabled by default for now)
- Flowed the new config var to the managed side and set up a mechanism to flow all of the thread pool config vars
- Removed debug-only config var COMPlus_ThreadpoolTickCountAdjustment, which didn't seem to be too useful
- Specialized native and managed thread pool paths based on the config var. Added assertions to paths that should not be reached depending on the config var.
"Move portable RegisteredWaitHandle implementation to shared ThreadPool.cs"
- Just moved the portable implementation, no functional changes. In preparation for merging the two implementations.
"Merge RegisteredWaitHandle implementations"
- Merged implementations of RegisteredWaitHandle using the portable version as the primary and specializing small parts of it for coreclr
- Fixed PortableThreadPool's registered waits to track SafeWaitHandles instead of WaitHandles similarly to the native implementation. The SafeWaitHandle in a WaitHandle can be modified, so it is retrieved once and reused thereafter. Also added/removed refs for the SafeWaitHandles that are registered.
"Separate portable-only portion of RegisteredWaitHandle"
- Separated RegisteredWaitHandle.UnregisterPortable into a different file, no functional changes. Those paths reference PortableThreadPool, which is conditionally included unlike ThreadPool.cs. Just for consistency such that the new file can be conditionally included similarly to PortableThreadPool.
"Fix timers, tiered compilation, introduced time-sensitive work item queue to simulate coreclr behavior"
- Wired work items queued from the native side (appdomain timer callback, tiered compilation background work callback) to queue them into the managed side
- The timer thread calls into managed code to queue the callback
- Some tiered compilation work item queuing paths cannot call managed code, so used a timer with zero due time instead
- Added a queue of "time-sensitive" work items to the managed side to mimic how work items queued from the native side ran previously. In particular, if the global queue is backed up, when using the native thread pool the native work items still run ahead of them periodically (based on the Dispatch quantum). Potentially they could be queued into the global queue but if it's backed up it can potentially significantly and perhaps artificially delay the appdomain timer callback and the tiering background jitting. I didn't want to change the behavior in an observable (and potentially bad) way here for now, a good time to revisit this would be when IO completion handling is added to the portable thread pool, then the native work items could be handled somewhat similarly.
"Implement ResetThreadPoolThread, set thread names for diagnostics"
- Aside from that, setting the thread names (at OS level) allows debuggers to identify the threads better as before. For threads that may run user code, the thread Name property is kept as null as before, such that it may be set without exception.
"Cache-line-separate PortableThreadPool._numRequestedWorkers similarly to coreclr"
- Was missed before, separated it for consistency
"Post wait completions to the IO completion port on Windows for coreclr, similarly to before"
- On Windows, wait completions are queued to the IO thread pool, which is still implemented on the native side. On Unixes, they are queued to the global queue.
"Reroute managed gate thread into unmanaged side to perform gate activites, don't use unmanaged gate thread"
- When the config var is enabled, removed the gate thread from the native side. Instead, the gate thread on the managed side calls into the native side to perform gate activities for the IO completion thread pool, and returns a value to indicate whether the gate thread is still necessary.
- Also added a native-to-managed entry point to request the gate thread to run for the IO completion thread pool
"Flow config values from CoreCLR to the portable thread pool for compat"
- Flowed the rest of the thread pool config vars to the managed side, such that COMPlus variables continue to work with the portable thread pool
- Config var values are stored in AppContext, made the names consistent for supported and unsupported values
"Port - ..." * 3
- Ported a few fixes that did not make it into the portable thread pool implementation
"Fix ETW events"
- Fixed the EventSource used by the portable thread pool, added missing events
- For now, the event source uses the same name and GUID as the native side. It seems to work for now for ETW, we may switch to a separate provider (along with updating tools) before enabling the portable thread pool by default.
- For enqueue/dequeue events, changed to use the object's hash code as the work item identifier instead of the pointer since the pointer may change between enqueue and dequeue
"Fix perf of counts structs"
- Structs used for multiple counts with interlocked operations were implemented with explicit struct layout and field offsets. The JIT seems to generate stack-based code for such structs and it was showing up as higher overhead in perf profiles compared to the equivalent native implementation. Slower code in compare-exchange loops can cause a larger gap of time between the read and the compare-exchange, which can also cause higher contention.
- Changed the structs to use manual bit manipulation instead, and microoptimized some paths. The code is still not as good as that generated by C++, but it seems to perform similarly based on perf profiles.
- Code size also improved in many cases, for example one of the larger differences was in MaybeAddWorkingWorker(), which decreased from 585 bytes to 382 bytes and with far fewer stack memory operations
"Fix perf of dispatch loop"
- Just some minor tweaks as I was looking at perf profiles and code of Dispatch()
"Fix perf of ThreadInt64PersistentCounter"
- The implementation used to count completed work items was using ThreadLocal<T>, which turned out to be too slow for that purpose according to perf profiles
- Changed it to rely on the user of the component to provide an object that tracks the count, which the user of the component would obtain from a ThreadStatic field
- Also removed the thread-local lookup per iteration in one of the hot paths in Dispatch() and improved inlining
"Miscellaneous perf fixes"
- A few more small tweaks as I was looking at perf profiles and code
- In ConcurrentQueue, added check for empty into the fast path
- For the portable thread pool, updated to trigger the worker thread Wait event after the short spin-wait completes and before actually waiting, the event is otherwise too verbose when profiling and changes performance characteristics
- Cache-line-separated the gate thread running state as is done in the native implementation
- Accessing PortableThreadPool.ThreadPoolInstance multiple times was generating less than ideal code that was noticeable in perf profiles. Tried to avoid it especially in hot paths, and in some cases where unnecessary for consistency if nothing else.
- Removed an extra call to Environment.TickCount in Dispatch() per iteration
- Noticed that a field that was intended to be cache-line-separated was not actually being separated, see ThreadPoolWorkQueue.numOutstandingThreadRequests is not being padded as requested, despite the explicit sequential layout #38215, fixed
"Fix starvation heuristic"
- Described in comment
"Implement worker tracking"
- Implemented the equivalent in the portable thread pool along with raising the relevant event
"Use smaller stack size for threads that don't run user code"
- Using the same stack size as in the native side for those threads
"Note some SOS dependencies, small fixes in hill climbing to make equivalent to coreclr"
- Corresponds with PR that updates SOS: Fill ThreadPool command data from portable thread pool info when enabled diagnostics#1274
- Also fixed a couple of things to work similarly to the native implementation
"Port some tests from CoreRT"
- Also improved some of the tests
"Fail-fast in thread pool native entry points specific to thread pool implementations based on config"
- Scanned all of the managed-to-native entry points from the thread pool and thread-entry functions, and promoted some assertions to be verified in all builds with fail-fast. May help to know in release builds when a path that should not be taken is taken and to avoid running further along that path.
"Fix SetMinThreads() and SetMaxThreads() to return true only when both changes are successful with synchronization"
- These are a bit awkward when the portable thread pool is enabled, because they should return true only when both changes are valid and return false without making any changes otherwise, and since the worker thread pool is on the managed side and IO thread pool is on the native side
- Added some managed-to-native entry points to allow checking validity before making the changes, all under a lock taken by the managed side
"Fix registered wait removals for fairness since there can be duplicate system wait objects in the wait array"
- Described in comment
"Allow multiple DotNETRuntime event providers/sources in EventPipe"
- Temporary change to EventPipe to be able to get events from dotnet-trace
- For now, the event source uses the same name and GUID as the native side. It seems to work for now for ETW, and with this change it seems to work with EventPipe for getting events. Subscribing to the NativeRuntimeEventSource does not get thread pool events yet, that is left for later. We may switch to a separate provider (along with updating tools) before enabling the portable thread pool by default, as a long-term solution.
"Fix registered wait handle timeout logic in the wait thread"
- The timeout logic was comparing against how long the last wait took and was not timing out waits sometimes, fixed to consider the total time since the last reset of timeout instead
"Fix Browser build"
- Updated the Browser-specific thread pool variant based on the other changes

Corresponding PR to update SOS: dotnet/diagnostics#1274
Fixes #32020

kouvel · 2020-06-22T18:37:51Z

Corresponding PR to update SOS: dotnet/diagnostics#1274

Change is ready for review, but flagging as NO-MERGE for now, until both PRs are ready.

kouvel · 2020-06-22T18:39:28Z

Perf data and testing info to follow

kouvel · 2020-06-22T19:47:21Z

ASP.NET RPS perf results

.	Machine	OS	Connections	Clr before	Clr after	Diff	Clr after with PTP	Diff from before	Mono JIT before	Mono JIT after	Diff
PlaintextPlatform	28-proc x64	Linux	512	14467899	14456871	-0.1%	14819954	2.4%	9626436	9679046	0.5%
Plaintext	28-proc x64	Linux	512	5265058	5286626	0.4%	5344409	1.5%	2855862	2890546	1.2%
JsonPlatform	28-proc x64	Linux	512	1164882	`1169727`	0.4%	1209226	3.8%	830938	842005	1.3%
Json	28-proc x64	Linux	512	954094	958050	0.4%	987212	3.5%	619826	624145	0.7%

No change when portable thread pool is disabled, some small improvements when enabled (some are within the error margin). I wasn't seeing regression or improvement before, but many things changed and hopefully it'll stay.

FortunesPlatform/Fortunes are currently not working. On Windows I was seeing large swings in numbers, without full CPU usage, and resulting in very different numbers from Linux, will try later on a different machine. I wasn't able to test on arm64, as updating the build with my locally cross-built native binaries doesn't seem to work (even without any changes), I can verify after it's merged and an sdk with the change is produced.

src/mono/netcore/System.Private.CoreLib/src/System/Threading/ThreadPool.Browser.Mono.cs

src/mono/netcore/System.Private.CoreLib/System.Private.CoreLib.csproj

kouvel · 2020-06-22T22:45:22Z

Microbenchmark perf. The benchmark measures throughput with a very short CPU-intensive delay in each work item, trying to measure mostly the overhead of the thread pool for very short work items, in sustained fashion or in bursty fashion. The length of the burst is the number multiplied by proc count. The benchmark is mostly useful to find larger regressions, sometimes even large differences don't translate into reality, especially as work items do more work.

Windows x64 8-proc

.	Clr before	Clr after	Diff	Clr after with PTP	Diff from before
Global sustained	9334	9301	-0.3%	9313	-0.2%
Global 1*proc burst	1181	1183	0.2%	1175	-0.5%
Global 4*proc burst	3363	3420	1.7%	3350	-0.4%
Global 16*proc burst	5809	5898	1.5%	5859	0.9%
Global 64*proc burst	6973	7019	0.7%	6978	0.1%
Global 256*proc burst	7148	7264	1.6%	7191	0.6%
Local sustained	19968	20711	3.7%	21423	7.3%
Local 1*proc burst	1174	1177	0.2%	1168	-0.5%
Local 4*proc burst	3684	3725	1.1%	3767	2.2%
Local 16*proc burst	8705	8818	1.3%	9036	3.8%
Local 64*proc burst	13433	13605	1.3%	14184	5.6%
Local 256*proc burst	15435	15560	0.8%	16188	4.9%

Linux x64 8-proc VM

.	Clr before	Clr after	Diff	Clr after with PTP	Diff from before	Mono JIT before	Mono JIT after	Diff	Notes for Mono JIT
Global sustained	8954	8827	-1.4%	8747	-2.3%	7098	7508	5.8%
Global 1*proc burst	121	114	-5.5%	132	9.4%	35	41	18.7%
Global 4*proc burst	1670	1696	1.5%	1694	1.4%	104	122	17.1%
Global 16*proc burst	4125	4111	-0.3%	4230	2.5%	3056	3195	4.6%
Global 64*proc burst	6005	6025	0.3%	6017	0.2%	5054	5256	4.0%
Global 256*proc burst	6719	6695	-0.4%	6765	0.7%	5846	6068	3.8%
Local sustained	16499	17801	7.9%	19305	17.0%	9328	10261	10.0%
Local 1*proc burst	114	110	-4.1%	125	9.5%	34	46	36.5%	High error
Local 4*proc burst	1634	1698	3.9%	1805	10.4%	153	235	53.3%	Very high error
Local 16*proc burst	5219	5275	1.1%	5518	5.7%	2170	3485	60.7%	Very high error
Local 64*proc burst	10008	10283	2.7%	10845	8.4%	6193	6573	6.1%
Local 256*proc burst	12883	13395	4.0%	14335	11.3%	7708	8362	8.5%

For Clr results, the tests with the regressions appear to be multimodal, not sure why. I don't think it's significant.

Those three tests when running under Mono seem to have very high error margins before and after the change, they can be ignored. I had collected the Mono perf numbers before, when my machine was reporting lower numbers on all runtimes, my machine does that sometimes.

Windows arm64 8-proc

.	Clr before	Clr after	Diff	Clr after with PTP	Diff from before
Global sustained	4600	4611	0.2%	4547	-1.2%
Global 1*proc burst	562	575	2.3%	645	14.8%
Global 4*proc burst	1458	1475	1.2%	1645	12.9%
Global 16*proc burst	2684	2725	1.5%	2813	4.8%
Global 64*proc burst	3298	3340	1.3%	3343	1.4%
Global 256*proc burst	3468	3506	1.1%	3489	0.6%
Local sustained	8662	8667	0.1%	8858	2.3%
Local 1*proc burst	557	574	2.9%	644	15.7%
Local 4*proc burst	1540	1545	0.3%	1792	16.4%
Local 16*proc burst	3464	3565	2.9%	3748	8.2%
Local 64*proc burst	5102	5271	3.3%	5234	2.6%
Local 256*proc burst	5371	5412	0.8%	5509	2.6%

Code:

using System;
using System.Diagnostics;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Threading;

namespace ThreadPoolWorkThroughput
{
    class Program
    {
        private static void Main(string[] args)
        {
            if (args.Length <= 0)
            {
                Console.WriteLine("Usage: ThreadPoolWorkThroughput <global|local> [burstLengthProcCountMultiplier]");
                return;
            }

            bool preferLocal;
            if ("global".Equals(args[0], StringComparison.OrdinalIgnoreCase))
                preferLocal = false;
            else if ("local".Equals(args[0], StringComparison.OrdinalIgnoreCase))
                preferLocal = true;
            else
            {
                Console.WriteLine("Invalid first parameter");
                return;
            }

            int processorCount = Environment.ProcessorCount;
            int burstLength = 0;
            if (args.Length > 1)
            {
                if (int.TryParse(args[1], out int burstLengthProcCountMultiplier) && burstLengthProcCountMultiplier > 0)
                    burstLength = burstLengthProcCountMultiplier * processorCount;
                else
                {
                    Console.WriteLine("Invalid second parameter");
                    return;
                }
            }

            ThreadPoolWorkThroughput(processorCount, preferLocal, burstLength);
        }

        private static void ThreadPoolWorkThroughput(int threadCount, bool preferLocal, int burstLength)
        {
            if (burstLength > 0 && burstLength < threadCount)
                burstLength = threadCount;

            var startTest = new ManualResetEvent(false);
            var threadOperationCounts = new int[(threadCount + 1) * 16];
            var workItemsScheduled = 0;
            var workItemsCompleted = new AutoResetEvent(false);
            ThreadPool.SetMinThreads(threadCount, threadCount);
            ThreadPool.SetMaxThreads(threadCount, threadCount);

            Action<int> workItem = null;
            workItem = toQueue =>
            {
                bool isSustained = burstLength <= 0;
                do
                {
                    if (isSustained)
                        ++toQueue;
                    else if (toQueue <= 0)
                        break;

                    bool localPreferLocal = preferLocal;
                    do
                    {
                        ThreadPool.UnsafeQueueUserWorkItem(workItem, 0, localPreferLocal);
                    } while (--toQueue > 0);
                } while (false);

                var tld = t_data ?? CreateThreadLocalData();
                Delay(tld);
                ++threadOperationCounts[tld.threadArrayIndex];
                if (!isSustained && Interlocked.Decrement(ref workItemsScheduled) == 0)
                    workItemsCompleted.Set();
            };

            var threadReady = new AutoResetEvent(false);
            Thread producerThread;
            if (burstLength <= 0)
            {
                producerThread = new Thread(() =>
                {
                    bool localPreferLocal = preferLocal;
                    int initialWorkItemCount = threadCount * 8;
                    threadReady.Set();
                    startTest.WaitOne();
                    for (int i = 0; i < initialWorkItemCount; ++i)
                        ThreadPool.UnsafeQueueUserWorkItem(workItem, 1, localPreferLocal);
                });
            }
            else
            {
                producerThread = new Thread(() =>
                {
                    var localThreadCount = threadCount;
                    bool localPreferLocal = preferLocal;
                    var localBurstLength = burstLength;
                    threadReady.Set();
                    startTest.WaitOne();
                    while (true)
                    {
                        Interlocked.Exchange(ref workItemsScheduled, localBurstLength);

                        int toQueueTotal = localBurstLength - localThreadCount;
                        int toQueuePerWorkItem = toQueueTotal <= 0 ? 0 : toQueueTotal / localThreadCount;
                        int toQueueExtra = toQueueTotal <= 0 ? 0 : toQueueTotal - toQueuePerWorkItem * localThreadCount;
                        for (int i = 0; i < localThreadCount; ++i)
                        {
                            int toQueue = toQueuePerWorkItem;
                            if (toQueueExtra > 0)
                            {
                                --toQueueExtra;
                                ++toQueue;
                            }
                            ThreadPool.UnsafeQueueUserWorkItem(workItem, toQueue, localPreferLocal);
                        }

                        workItemsCompleted.WaitOne();
                    }
                });
            }
            producerThread.IsBackground = true;
            producerThread.Start();
            threadReady.WaitOne();

            Run(startTest, threadOperationCounts);
        }

        private static void Run(ManualResetEvent startTest, int[] threadOperationCounts)
        {
            var sw = new Stopwatch();
            int threadCount = threadOperationCounts.Length / 16 - 1;
            var afterWarmupOperationCounts = new long[threadCount];
            var operationCounts = new long[threadCount];
            startTest.Set();

            // Warmup

            Thread.Sleep(1000);

            for (int j = 0; j < 4; ++j)
            {
                for (int i = 0; i < threadCount; ++i)
                    afterWarmupOperationCounts[i] = threadOperationCounts[(i + 1) * 16];

                // Measure

                sw.Restart();
                Thread.Sleep(500);
                sw.Stop();

                for (int i = 0; i < threadCount; ++i)
                    operationCounts[i] = threadOperationCounts[(i + 1) * 16];
                for (int i = 0; i < threadCount; ++i)
                    operationCounts[i] -= afterWarmupOperationCounts[i];

                double score = operationCounts.Sum() / sw.Elapsed.TotalMilliseconds;
                Console.WriteLine($"Score: {score,15:0.000000}");
            }
        }

        private sealed class ThreadLocalData
        {
            private static int s_previousThreadArrayIndex;

            public int threadArrayIndex = Interlocked.Increment(ref s_previousThreadArrayIndex) * 16;
            public Random rng = new Random();
            public int delayFibSum;
        }

        [ThreadStatic]
        private static ThreadLocalData t_data;

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static ThreadLocalData CreateThreadLocalData()
        {
            var tld = new ThreadLocalData();
            t_data = tld;
            return tld;
        }

        private static void Delay(ThreadLocalData tld) => tld.delayFibSum += Fib(tld.rng.Next(4, 10));

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static int Fib(int n) => n <= 1 ? n : Fib(n - 2) + Fib(n - 1);
    }
}

kouvel · 2020-06-22T22:56:21Z

General testing done:

ThreadPool, Timer, and Thread tests
Thread pool features - starvation on worker threads, starvation on IO completion threads, hill climbing
Anything that was added like each reroute based on config, worker tracking, etc.
SOS ThreadPool -ti -wi, VS/WinDbg/lldb thread views
PerfView, perfcollect, dotnet-trace for events (currently doesn't seem to be working for mono), thread type identification from events
dotnet-counters, profile comparisons with events and some benchmarks to see that thread pool is behaving similarly
- perfcollect was not showing EventSource events including existing ones, so when the portable thread pool is enabled those would not show up currently
CscRoslynSource

kouvel · 2020-06-22T23:18:44Z

Looks like Mono folks are already added, CC some more people

For general changes: @janvorli @jkoritzinsky @jkotas @mangod9 @stephentoub
For events: @brianrob @josalem @noahfalk @sywhang

src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLock.cs

src/libraries/System.Private.CoreLib/src/System/Threading/ThreadInt64PersistentCounter.cs

src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPoolEventSource.cs

src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/FrameworkEventSource.cs

…ent source name in mono, add comments

… by default

- For a registered wait that is automatically unregistered (due to `executeOnlyOnce: true`), the registered wait handle gets added to the array of pending removals, and this automatic unregister does not wait for the removal to actually happen - If shortly after that a user calls `Unregister(null)` on the same registered wait handle, it is supposed to wait for the removal to actually happen, but was not because the handle is already in the array of pending removals - A `Dispose` on the wait handle shortly after `Unregister` returns would delete the safe handle and `DangerousRelease` upon removal would throw and crash the process - Fixed by waiting when a registered wait handle is pending removal, regardless of whether the caller of `Unregister` added the handle to the array of pending removals or if it was added by anyone else

… supported, fix assert for browser builds

…patibility

…hem as parameters

kouvel · 2020-10-20T15:11:39Z

Rebased to fix conflict

kouvel added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) area-System.Threading labels Jun 22, 2020

kouvel added this to the 5.0.0 milestone Jun 22, 2020

kouvel requested review from akoeplinger, EgorBo, lambdageek, lateralusX, marek-safar, steveisok and vargaz as code owners June 22, 2020 18:34

kouvel self-assigned this Jun 22, 2020

kouvel mentioned this pull request Jun 22, 2020

Fill ThreadPool command data from portable thread pool info when enabled dotnet/diagnostics#1274

Merged

kouvel added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) and removed NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) labels Jun 22, 2020

monojenkins mentioned this pull request Jun 22, 2020

Migrate coreclr's worker thread pool to be able to use the portable thread pool in opt-in fashion mono/mono#20004

Merged

kouvel changed the title ~~[WIP] Migrate coreclr's worker thread pool to be able to use the portable thread pool in opt-in fashion~~ Migrate coreclr's worker thread pool to be able to use the portable thread pool in opt-in fashion Jun 22, 2020

marek-safar reviewed Jun 22, 2020

View reviewed changes

mangod9 reviewed Jun 23, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLock.cs Outdated Show resolved Hide resolved

mangod9 reviewed Jun 23, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Threading/ThreadInt64PersistentCounter.cs Outdated Show resolved Hide resolved