Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

kouvel
Copy link
Contributor

@kouvel kouvel commented Jun 22, 2020

  • Enables using the portable thread pool with coreclr as opt-in. Change is off by default for now, and can be enabled with COMPlus_ThreadPool_UsePortableThreadPool=1. Once it's had bake time and seen to be stable, at a reasonable time in the future the config flag would ideally be removed and the relevant parts of native implementation deleted.
  • The IO thread pool is not being migrated in this change, and remains on the native side
  • My goal was to get compatible behavior, compatible with diagnostics tools, and similar perf to the native implementation in coreclr. Tried to avoid changing scheduling behavior, behavior of heuristics, etc., compared with that implementation.
  • The eventual goal is to have one mostly managed thread pool implementation that can be shared between runtimes, to ease maintenance going forward

Commit descriptions:

  • "Add dependencies"
    • Ported LowLevelLock from CoreRT, and moved LowLevelSpinWaiter to shared. Since we support Thread.Interrupt(), they were necessary in the wait subsystem in CoreRT partly to support that, and were also used in the portable thread pool implementation where a pending thread interrupt on a thread pool thread would otherwise crash the process. Interruptible waits are already used in the managed side of the thread pool in the queue implementations. It may be reasonable to ignore the thread interrupt problem and suggest that it not be used on thread pool threads, but for now I just brought in the dependencies to keep behavior consistent with the native implementation.
  • "Add config var"
    • Added config var COMPlus_ThreadPool_UsePortableThreadPool (disabled by default for now)
    • Flowed the new config var to the managed side and set up a mechanism to flow all of the thread pool config vars
    • Removed debug-only config var COMPlus_ThreadpoolTickCountAdjustment, which didn't seem to be too useful
    • Specialized native and managed thread pool paths based on the config var. Added assertions to paths that should not be reached depending on the config var.
  • "Move portable RegisteredWaitHandle implementation to shared ThreadPool.cs"
    • Just moved the portable implementation, no functional changes. In preparation for merging the two implementations.
  • "Merge RegisteredWaitHandle implementations"
    • Merged implementations of RegisteredWaitHandle using the portable version as the primary and specializing small parts of it for coreclr
    • Fixed PortableThreadPool's registered waits to track SafeWaitHandles instead of WaitHandles similarly to the native implementation. The SafeWaitHandle in a WaitHandle can be modified, so it is retrieved once and reused thereafter. Also added/removed refs for the SafeWaitHandles that are registered.
  • "Separate portable-only portion of RegisteredWaitHandle"
    • Separated RegisteredWaitHandle.UnregisterPortable into a different file, no functional changes. Those paths reference PortableThreadPool, which is conditionally included unlike ThreadPool.cs. Just for consistency such that the new file can be conditionally included similarly to PortableThreadPool.
  • "Fix timers, tiered compilation, introduced time-sensitive work item queue to simulate coreclr behavior"
    • Wired work items queued from the native side (appdomain timer callback, tiered compilation background work callback) to queue them into the managed side
    • The timer thread calls into managed code to queue the callback
    • Some tiered compilation work item queuing paths cannot call managed code, so used a timer with zero due time instead
    • Added a queue of "time-sensitive" work items to the managed side to mimic how work items queued from the native side ran previously. In particular, if the global queue is backed up, when using the native thread pool the native work items still run ahead of them periodically (based on the Dispatch quantum). Potentially they could be queued into the global queue but if it's backed up it can potentially significantly and perhaps artificially delay the appdomain timer callback and the tiering background jitting. I didn't want to change the behavior in an observable (and potentially bad) way here for now, a good time to revisit this would be when IO completion handling is added to the portable thread pool, then the native work items could be handled somewhat similarly.
  • "Implement ResetThreadPoolThread, set thread names for diagnostics"
    • Aside from that, setting the thread names (at OS level) allows debuggers to identify the threads better as before. For threads that may run user code, the thread Name property is kept as null as before, such that it may be set without exception.
  • "Cache-line-separate PortableThreadPool._numRequestedWorkers similarly to coreclr"
    • Was missed before, separated it for consistency
  • "Post wait completions to the IO completion port on Windows for coreclr, similarly to before"
    • On Windows, wait completions are queued to the IO thread pool, which is still implemented on the native side. On Unixes, they are queued to the global queue.
  • "Reroute managed gate thread into unmanaged side to perform gate activites, don't use unmanaged gate thread"
    • When the config var is enabled, removed the gate thread from the native side. Instead, the gate thread on the managed side calls into the native side to perform gate activities for the IO completion thread pool, and returns a value to indicate whether the gate thread is still necessary.
    • Also added a native-to-managed entry point to request the gate thread to run for the IO completion thread pool
  • "Flow config values from CoreCLR to the portable thread pool for compat"
    • Flowed the rest of the thread pool config vars to the managed side, such that COMPlus variables continue to work with the portable thread pool
    • Config var values are stored in AppContext, made the names consistent for supported and unsupported values
  • "Port - ..." * 3
    • Ported a few fixes that did not make it into the portable thread pool implementation
  • "Fix ETW events"
    • Fixed the EventSource used by the portable thread pool, added missing events
    • For now, the event source uses the same name and GUID as the native side. It seems to work for now for ETW, we may switch to a separate provider (along with updating tools) before enabling the portable thread pool by default.
    • For enqueue/dequeue events, changed to use the object's hash code as the work item identifier instead of the pointer since the pointer may change between enqueue and dequeue
  • "Fix perf of counts structs"
    • Structs used for multiple counts with interlocked operations were implemented with explicit struct layout and field offsets. The JIT seems to generate stack-based code for such structs and it was showing up as higher overhead in perf profiles compared to the equivalent native implementation. Slower code in compare-exchange loops can cause a larger gap of time between the read and the compare-exchange, which can also cause higher contention.
    • Changed the structs to use manual bit manipulation instead, and microoptimized some paths. The code is still not as good as that generated by C++, but it seems to perform similarly based on perf profiles.
    • Code size also improved in many cases, for example one of the larger differences was in MaybeAddWorkingWorker(), which decreased from 585 bytes to 382 bytes and with far fewer stack memory operations
  • "Fix perf of dispatch loop"
    • Just some minor tweaks as I was looking at perf profiles and code of Dispatch()
  • "Fix perf of ThreadInt64PersistentCounter"
    • The implementation used to count completed work items was using ThreadLocal<T>, which turned out to be too slow for that purpose according to perf profiles
    • Changed it to rely on the user of the component to provide an object that tracks the count, which the user of the component would obtain from a ThreadStatic field
    • Also removed the thread-local lookup per iteration in one of the hot paths in Dispatch() and improved inlining
  • "Miscellaneous perf fixes"
    • A few more small tweaks as I was looking at perf profiles and code
    • In ConcurrentQueue, added check for empty into the fast path
    • For the portable thread pool, updated to trigger the worker thread Wait event after the short spin-wait completes and before actually waiting, the event is otherwise too verbose when profiling and changes performance characteristics
    • Cache-line-separated the gate thread running state as is done in the native implementation
    • Accessing PortableThreadPool.ThreadPoolInstance multiple times was generating less than ideal code that was noticeable in perf profiles. Tried to avoid it especially in hot paths, and in some cases where unnecessary for consistency if nothing else.
    • Removed an extra call to Environment.TickCount in Dispatch() per iteration
    • Noticed that a field that was intended to be cache-line-separated was not actually being separated, see ThreadPoolWorkQueue.numOutstandingThreadRequests is not being padded as requested, despite the explicit sequential layout #38215, fixed
  • "Fix starvation heuristic"
    • Described in comment
  • "Implement worker tracking"
    • Implemented the equivalent in the portable thread pool along with raising the relevant event
  • "Use smaller stack size for threads that don't run user code"
    • Using the same stack size as in the native side for those threads
  • "Note some SOS dependencies, small fixes in hill climbing to make equivalent to coreclr"
  • "Port some tests from CoreRT"
    • Also improved some of the tests
  • "Fail-fast in thread pool native entry points specific to thread pool implementations based on config"
    • Scanned all of the managed-to-native entry points from the thread pool and thread-entry functions, and promoted some assertions to be verified in all builds with fail-fast. May help to know in release builds when a path that should not be taken is taken and to avoid running further along that path.
  • "Fix SetMinThreads() and SetMaxThreads() to return true only when both changes are successful with synchronization"
    • These are a bit awkward when the portable thread pool is enabled, because they should return true only when both changes are valid and return false without making any changes otherwise, and since the worker thread pool is on the managed side and IO thread pool is on the native side
    • Added some managed-to-native entry points to allow checking validity before making the changes, all under a lock taken by the managed side
  • "Fix registered wait removals for fairness since there can be duplicate system wait objects in the wait array"
    • Described in comment
  • "Allow multiple DotNETRuntime event providers/sources in EventPipe"
    • Temporary change to EventPipe to be able to get events from dotnet-trace
    • For now, the event source uses the same name and GUID as the native side. It seems to work for now for ETW, and with this change it seems to work with EventPipe for getting events. Subscribing to the NativeRuntimeEventSource does not get thread pool events yet, that is left for later. We may switch to a separate provider (along with updating tools) before enabling the portable thread pool by default, as a long-term solution.
  • "Fix registered wait handle timeout logic in the wait thread"
    • The timeout logic was comparing against how long the last wait took and was not timing out waits sometimes, fixed to consider the total time since the last reset of timeout instead
  • "Fix Browser build"
    • Updated the Browser-specific thread pool variant based on the other changes

Corresponding PR to update SOS: dotnet/diagnostics#1274
Fixes #32020

@kouvel kouvel added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) area-System.Threading labels Jun 22, 2020
@kouvel kouvel added this to the 5.0.0 milestone Jun 22, 2020
@kouvel kouvel self-assigned this Jun 22, 2020
@kouvel kouvel added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) and removed NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) labels Jun 22, 2020
@kouvel
Copy link
Contributor Author

kouvel commented Jun 22, 2020

Corresponding PR to update SOS: dotnet/diagnostics#1274

Change is ready for review, but flagging as NO-MERGE for now, until both PRs are ready.

@kouvel
Copy link
Contributor Author

kouvel commented Jun 22, 2020

Perf data and testing info to follow

@kouvel
Copy link
Contributor Author

kouvel commented Jun 22, 2020

ASP.NET RPS perf results

. Machine OS Connections Clr before Clr after Diff Clr after with PTP Diff from before Mono JIT before Mono JIT after Diff
PlaintextPlatform 28-proc x64 Linux 512 14467899 14456871 -0.1% 14819954 2.4% 9626436 9679046 0.5%
Plaintext 28-proc x64 Linux 512 5265058 5286626 0.4% 5344409 1.5% 2855862 2890546 1.2%
JsonPlatform 28-proc x64 Linux 512 1164882 1169727 0.4% 1209226 3.8% 830938 842005 1.3%
Json 28-proc x64 Linux 512 954094 958050 0.4% 987212 3.5% 619826 624145 0.7%

No change when portable thread pool is disabled, some small improvements when enabled (some are within the error margin). I wasn't seeing regression or improvement before, but many things changed and hopefully it'll stay.

FortunesPlatform/Fortunes are currently not working. On Windows I was seeing large swings in numbers, without full CPU usage, and resulting in very different numbers from Linux, will try later on a different machine. I wasn't able to test on arm64, as updating the build with my locally cross-built native binaries doesn't seem to work (even without any changes), I can verify after it's merged and an sdk with the change is produced.

@kouvel kouvel changed the title [WIP] Migrate coreclr's worker thread pool to be able to use the portable thread pool in opt-in fashion Migrate coreclr's worker thread pool to be able to use the portable thread pool in opt-in fashion Jun 22, 2020
@kouvel
Copy link
Contributor Author

kouvel commented Jun 22, 2020

Microbenchmark perf. The benchmark measures throughput with a very short CPU-intensive delay in each work item, trying to measure mostly the overhead of the thread pool for very short work items, in sustained fashion or in bursty fashion. The length of the burst is the number multiplied by proc count. The benchmark is mostly useful to find larger regressions, sometimes even large differences don't translate into reality, especially as work items do more work.

Windows x64 8-proc

. Clr before Clr after Diff Clr after with PTP Diff from before
Global sustained 9334 9301 -0.3% 9313 -0.2%
Global 1*proc burst 1181 1183 0.2% 1175 -0.5%
Global 4*proc burst 3363 3420 1.7% 3350 -0.4%
Global 16*proc burst 5809 5898 1.5% 5859 0.9%
Global 64*proc burst 6973 7019 0.7% 6978 0.1%
Global 256*proc burst 7148 7264 1.6% 7191 0.6%
Local sustained 19968 20711 3.7% 21423 7.3%
Local 1*proc burst 1174 1177 0.2% 1168 -0.5%
Local 4*proc burst 3684 3725 1.1% 3767 2.2%
Local 16*proc burst 8705 8818 1.3% 9036 3.8%
Local 64*proc burst 13433 13605 1.3% 14184 5.6%
Local 256*proc burst 15435 15560 0.8% 16188 4.9%

Linux x64 8-proc VM

. Clr before Clr after Diff Clr after with PTP Diff from before Mono JIT before Mono JIT after Diff Notes for Mono JIT
Global sustained 8954 8827 -1.4% 8747 -2.3% 7098 7508 5.8%
Global 1*proc burst 121 114 -5.5% 132 9.4% 35 41 18.7%
Global 4*proc burst 1670 1696 1.5% 1694 1.4% 104 122 17.1%
Global 16*proc burst 4125 4111 -0.3% 4230 2.5% 3056 3195 4.6%
Global 64*proc burst 6005 6025 0.3% 6017 0.2% 5054 5256 4.0%
Global 256*proc burst 6719 6695 -0.4% 6765 0.7% 5846 6068 3.8%
Local sustained 16499 17801 7.9% 19305 17.0% 9328 10261 10.0%
Local 1*proc burst 114 110 -4.1% 125 9.5% 34 46 36.5% High error
Local 4*proc burst 1634 1698 3.9% 1805 10.4% 153 235 53.3% Very high error
Local 16*proc burst 5219 5275 1.1% 5518 5.7% 2170 3485 60.7% Very high error
Local 64*proc burst 10008 10283 2.7% 10845 8.4% 6193 6573 6.1%
Local 256*proc burst 12883 13395 4.0% 14335 11.3% 7708 8362 8.5%

For Clr results, the tests with the regressions appear to be multimodal, not sure why. I don't think it's significant.

Those three tests when running under Mono seem to have very high error margins before and after the change, they can be ignored. I had collected the Mono perf numbers before, when my machine was reporting lower numbers on all runtimes, my machine does that sometimes.

Windows arm64 8-proc

. Clr before Clr after Diff Clr after with PTP Diff from before
Global sustained 4600 4611 0.2% 4547 -1.2%
Global 1*proc burst 562 575 2.3% 645 14.8%
Global 4*proc burst 1458 1475 1.2% 1645 12.9%
Global 16*proc burst 2684 2725 1.5% 2813 4.8%
Global 64*proc burst 3298 3340 1.3% 3343 1.4%
Global 256*proc burst 3468 3506 1.1% 3489 0.6%
Local sustained 8662 8667 0.1% 8858 2.3%
Local 1*proc burst 557 574 2.9% 644 15.7%
Local 4*proc burst 1540 1545 0.3% 1792 16.4%
Local 16*proc burst 3464 3565 2.9% 3748 8.2%
Local 64*proc burst 5102 5271 3.3% 5234 2.6%
Local 256*proc burst 5371 5412 0.8% 5509 2.6%

Code:

using System;
using System.Diagnostics;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Threading;

namespace ThreadPoolWorkThroughput
{
    class Program
    {
        private static void Main(string[] args)
        {
            if (args.Length <= 0)
            {
                Console.WriteLine("Usage: ThreadPoolWorkThroughput <global|local> [burstLengthProcCountMultiplier]");
                return;
            }

            bool preferLocal;
            if ("global".Equals(args[0], StringComparison.OrdinalIgnoreCase))
                preferLocal = false;
            else if ("local".Equals(args[0], StringComparison.OrdinalIgnoreCase))
                preferLocal = true;
            else
            {
                Console.WriteLine("Invalid first parameter");
                return;
            }

            int processorCount = Environment.ProcessorCount;
            int burstLength = 0;
            if (args.Length > 1)
            {
                if (int.TryParse(args[1], out int burstLengthProcCountMultiplier) && burstLengthProcCountMultiplier > 0)
                    burstLength = burstLengthProcCountMultiplier * processorCount;
                else
                {
                    Console.WriteLine("Invalid second parameter");
                    return;
                }
            }

            ThreadPoolWorkThroughput(processorCount, preferLocal, burstLength);
        }

        private static void ThreadPoolWorkThroughput(int threadCount, bool preferLocal, int burstLength)
        {
            if (burstLength > 0 && burstLength < threadCount)
                burstLength = threadCount;

            var startTest = new ManualResetEvent(false);
            var threadOperationCounts = new int[(threadCount + 1) * 16];
            var workItemsScheduled = 0;
            var workItemsCompleted = new AutoResetEvent(false);
            ThreadPool.SetMinThreads(threadCount, threadCount);
            ThreadPool.SetMaxThreads(threadCount, threadCount);

            Action<int> workItem = null;
            workItem = toQueue =>
            {
                bool isSustained = burstLength <= 0;
                do
                {
                    if (isSustained)
                        ++toQueue;
                    else if (toQueue <= 0)
                        break;

                    bool localPreferLocal = preferLocal;
                    do
                    {
                        ThreadPool.UnsafeQueueUserWorkItem(workItem, 0, localPreferLocal);
                    } while (--toQueue > 0);
                } while (false);

                var tld = t_data ?? CreateThreadLocalData();
                Delay(tld);
                ++threadOperationCounts[tld.threadArrayIndex];
                if (!isSustained && Interlocked.Decrement(ref workItemsScheduled) == 0)
                    workItemsCompleted.Set();
            };

            var threadReady = new AutoResetEvent(false);
            Thread producerThread;
            if (burstLength <= 0)
            {
                producerThread = new Thread(() =>
                {
                    bool localPreferLocal = preferLocal;
                    int initialWorkItemCount = threadCount * 8;
                    threadReady.Set();
                    startTest.WaitOne();
                    for (int i = 0; i < initialWorkItemCount; ++i)
                        ThreadPool.UnsafeQueueUserWorkItem(workItem, 1, localPreferLocal);
                });
            }
            else
            {
                producerThread = new Thread(() =>
                {
                    var localThreadCount = threadCount;
                    bool localPreferLocal = preferLocal;
                    var localBurstLength = burstLength;
                    threadReady.Set();
                    startTest.WaitOne();
                    while (true)
                    {
                        Interlocked.Exchange(ref workItemsScheduled, localBurstLength);

                        int toQueueTotal = localBurstLength - localThreadCount;
                        int toQueuePerWorkItem = toQueueTotal <= 0 ? 0 : toQueueTotal / localThreadCount;
                        int toQueueExtra = toQueueTotal <= 0 ? 0 : toQueueTotal - toQueuePerWorkItem * localThreadCount;
                        for (int i = 0; i < localThreadCount; ++i)
                        {
                            int toQueue = toQueuePerWorkItem;
                            if (toQueueExtra > 0)
                            {
                                --toQueueExtra;
                                ++toQueue;
                            }
                            ThreadPool.UnsafeQueueUserWorkItem(workItem, toQueue, localPreferLocal);
                        }

                        workItemsCompleted.WaitOne();
                    }
                });
            }
            producerThread.IsBackground = true;
            producerThread.Start();
            threadReady.WaitOne();

            Run(startTest, threadOperationCounts);
        }

        private static void Run(ManualResetEvent startTest, int[] threadOperationCounts)
        {
            var sw = new Stopwatch();
            int threadCount = threadOperationCounts.Length / 16 - 1;
            var afterWarmupOperationCounts = new long[threadCount];
            var operationCounts = new long[threadCount];
            startTest.Set();

            // Warmup

            Thread.Sleep(1000);

            for (int j = 0; j < 4; ++j)
            {
                for (int i = 0; i < threadCount; ++i)
                    afterWarmupOperationCounts[i] = threadOperationCounts[(i + 1) * 16];

                // Measure

                sw.Restart();
                Thread.Sleep(500);
                sw.Stop();

                for (int i = 0; i < threadCount; ++i)
                    operationCounts[i] = threadOperationCounts[(i + 1) * 16];
                for (int i = 0; i < threadCount; ++i)
                    operationCounts[i] -= afterWarmupOperationCounts[i];

                double score = operationCounts.Sum() / sw.Elapsed.TotalMilliseconds;
                Console.WriteLine($"Score: {score,15:0.000000}");
            }
        }

        private sealed class ThreadLocalData
        {
            private static int s_previousThreadArrayIndex;

            public int threadArrayIndex = Interlocked.Increment(ref s_previousThreadArrayIndex) * 16;
            public Random rng = new Random();
            public int delayFibSum;
        }

        [ThreadStatic]
        private static ThreadLocalData t_data;

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static ThreadLocalData CreateThreadLocalData()
        {
            var tld = new ThreadLocalData();
            t_data = tld;
            return tld;
        }

        private static void Delay(ThreadLocalData tld) => tld.delayFibSum += Fib(tld.rng.Next(4, 10));

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static int Fib(int n) => n <= 1 ? n : Fib(n - 2) + Fib(n - 1);
    }
}

@kouvel
Copy link
Contributor Author

kouvel commented Jun 22, 2020

General testing done:

  • ThreadPool, Timer, and Thread tests
  • Thread pool features - starvation on worker threads, starvation on IO completion threads, hill climbing
  • Anything that was added like each reroute based on config, worker tracking, etc.
  • SOS ThreadPool -ti -wi, VS/WinDbg/lldb thread views
  • PerfView, perfcollect, dotnet-trace for events (currently doesn't seem to be working for mono), thread type identification from events
  • dotnet-counters, profile comparisons with events and some benchmarks to see that thread pool is behaving similarly
    • perfcollect was not showing EventSource events including existing ones, so when the portable thread pool is enabled those would not show up currently
  • CscRoslynSource

@kouvel
Copy link
Contributor Author

kouvel commented Jun 22, 2020

Looks like Mono folks are already added, CC some more people

Koundinya Veluri added 16 commits October 20, 2020 08:04
- For a registered wait that is automatically unregistered (due to `executeOnlyOnce: true`), the registered wait handle gets added to the array of pending removals, and this automatic unregister does not wait for the removal to actually happen
- If shortly after that a user calls `Unregister(null)` on the same registered wait handle, it is supposed to wait for the removal to actually happen, but was not because the handle is already in the array of pending removals
- A `Dispose` on the wait handle shortly after `Unregister` returns would delete the safe handle and `DangerousRelease` upon removal would throw and crash the process
- Fixed by waiting when a registered wait handle is pending removal, regardless of whether the caller of `Unregister` added the handle to the array of pending removals or if it was added by anyone else
@kouvel
Copy link
Contributor Author

kouvel commented Oct 20, 2020

Rebased to fix conflict

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[mono] Test failed: System.Threading.ThreadPools.Tests.ThreadPoolTests.SetMinMaxThreadsTest_ChangedInDotNetCore