-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Migrate coreclr's worker thread pool to be able to use the portable thread pool in opt-in fashion #38225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Corresponding PR to update SOS: dotnet/diagnostics#1274 Change is ready for review, but flagging as NO-MERGE for now, until both PRs are ready. |
Perf data and testing info to follow |
ASP.NET RPS perf results
No change when portable thread pool is disabled, some small improvements when enabled (some are within the error margin). I wasn't seeing regression or improvement before, but many things changed and hopefully it'll stay. FortunesPlatform/Fortunes are currently not working. On Windows I was seeing large swings in numbers, without full CPU usage, and resulting in very different numbers from Linux, will try later on a different machine. I wasn't able to test on arm64, as updating the build with my locally cross-built native binaries doesn't seem to work (even without any changes), I can verify after it's merged and an sdk with the change is produced. |
src/mono/netcore/System.Private.CoreLib/src/System/Threading/ThreadPool.Browser.Mono.cs
Outdated
Show resolved
Hide resolved
src/mono/netcore/System.Private.CoreLib/src/System/Threading/ThreadPool.Browser.Mono.cs
Outdated
Show resolved
Hide resolved
src/mono/netcore/System.Private.CoreLib/src/System/Threading/ThreadPool.Browser.Mono.cs
Outdated
Show resolved
Hide resolved
src/mono/netcore/System.Private.CoreLib/System.Private.CoreLib.csproj
Outdated
Show resolved
Hide resolved
Microbenchmark perf. The benchmark measures throughput with a very short CPU-intensive delay in each work item, trying to measure mostly the overhead of the thread pool for very short work items, in sustained fashion or in bursty fashion. The length of the burst is the number multiplied by proc count. The benchmark is mostly useful to find larger regressions, sometimes even large differences don't translate into reality, especially as work items do more work. Windows x64 8-proc
Linux x64 8-proc VM
For Clr results, the tests with the regressions appear to be multimodal, not sure why. I don't think it's significant. Those three tests when running under Mono seem to have very high error margins before and after the change, they can be ignored. I had collected the Mono perf numbers before, when my machine was reporting lower numbers on all runtimes, my machine does that sometimes. Windows arm64 8-proc
Code: using System;
using System.Diagnostics;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Threading;
namespace ThreadPoolWorkThroughput
{
class Program
{
private static void Main(string[] args)
{
if (args.Length <= 0)
{
Console.WriteLine("Usage: ThreadPoolWorkThroughput <global|local> [burstLengthProcCountMultiplier]");
return;
}
bool preferLocal;
if ("global".Equals(args[0], StringComparison.OrdinalIgnoreCase))
preferLocal = false;
else if ("local".Equals(args[0], StringComparison.OrdinalIgnoreCase))
preferLocal = true;
else
{
Console.WriteLine("Invalid first parameter");
return;
}
int processorCount = Environment.ProcessorCount;
int burstLength = 0;
if (args.Length > 1)
{
if (int.TryParse(args[1], out int burstLengthProcCountMultiplier) && burstLengthProcCountMultiplier > 0)
burstLength = burstLengthProcCountMultiplier * processorCount;
else
{
Console.WriteLine("Invalid second parameter");
return;
}
}
ThreadPoolWorkThroughput(processorCount, preferLocal, burstLength);
}
private static void ThreadPoolWorkThroughput(int threadCount, bool preferLocal, int burstLength)
{
if (burstLength > 0 && burstLength < threadCount)
burstLength = threadCount;
var startTest = new ManualResetEvent(false);
var threadOperationCounts = new int[(threadCount + 1) * 16];
var workItemsScheduled = 0;
var workItemsCompleted = new AutoResetEvent(false);
ThreadPool.SetMinThreads(threadCount, threadCount);
ThreadPool.SetMaxThreads(threadCount, threadCount);
Action<int> workItem = null;
workItem = toQueue =>
{
bool isSustained = burstLength <= 0;
do
{
if (isSustained)
++toQueue;
else if (toQueue <= 0)
break;
bool localPreferLocal = preferLocal;
do
{
ThreadPool.UnsafeQueueUserWorkItem(workItem, 0, localPreferLocal);
} while (--toQueue > 0);
} while (false);
var tld = t_data ?? CreateThreadLocalData();
Delay(tld);
++threadOperationCounts[tld.threadArrayIndex];
if (!isSustained && Interlocked.Decrement(ref workItemsScheduled) == 0)
workItemsCompleted.Set();
};
var threadReady = new AutoResetEvent(false);
Thread producerThread;
if (burstLength <= 0)
{
producerThread = new Thread(() =>
{
bool localPreferLocal = preferLocal;
int initialWorkItemCount = threadCount * 8;
threadReady.Set();
startTest.WaitOne();
for (int i = 0; i < initialWorkItemCount; ++i)
ThreadPool.UnsafeQueueUserWorkItem(workItem, 1, localPreferLocal);
});
}
else
{
producerThread = new Thread(() =>
{
var localThreadCount = threadCount;
bool localPreferLocal = preferLocal;
var localBurstLength = burstLength;
threadReady.Set();
startTest.WaitOne();
while (true)
{
Interlocked.Exchange(ref workItemsScheduled, localBurstLength);
int toQueueTotal = localBurstLength - localThreadCount;
int toQueuePerWorkItem = toQueueTotal <= 0 ? 0 : toQueueTotal / localThreadCount;
int toQueueExtra = toQueueTotal <= 0 ? 0 : toQueueTotal - toQueuePerWorkItem * localThreadCount;
for (int i = 0; i < localThreadCount; ++i)
{
int toQueue = toQueuePerWorkItem;
if (toQueueExtra > 0)
{
--toQueueExtra;
++toQueue;
}
ThreadPool.UnsafeQueueUserWorkItem(workItem, toQueue, localPreferLocal);
}
workItemsCompleted.WaitOne();
}
});
}
producerThread.IsBackground = true;
producerThread.Start();
threadReady.WaitOne();
Run(startTest, threadOperationCounts);
}
private static void Run(ManualResetEvent startTest, int[] threadOperationCounts)
{
var sw = new Stopwatch();
int threadCount = threadOperationCounts.Length / 16 - 1;
var afterWarmupOperationCounts = new long[threadCount];
var operationCounts = new long[threadCount];
startTest.Set();
// Warmup
Thread.Sleep(1000);
for (int j = 0; j < 4; ++j)
{
for (int i = 0; i < threadCount; ++i)
afterWarmupOperationCounts[i] = threadOperationCounts[(i + 1) * 16];
// Measure
sw.Restart();
Thread.Sleep(500);
sw.Stop();
for (int i = 0; i < threadCount; ++i)
operationCounts[i] = threadOperationCounts[(i + 1) * 16];
for (int i = 0; i < threadCount; ++i)
operationCounts[i] -= afterWarmupOperationCounts[i];
double score = operationCounts.Sum() / sw.Elapsed.TotalMilliseconds;
Console.WriteLine($"Score: {score,15:0.000000}");
}
}
private sealed class ThreadLocalData
{
private static int s_previousThreadArrayIndex;
public int threadArrayIndex = Interlocked.Increment(ref s_previousThreadArrayIndex) * 16;
public Random rng = new Random();
public int delayFibSum;
}
[ThreadStatic]
private static ThreadLocalData t_data;
[MethodImpl(MethodImplOptions.NoInlining)]
private static ThreadLocalData CreateThreadLocalData()
{
var tld = new ThreadLocalData();
t_data = tld;
return tld;
}
private static void Delay(ThreadLocalData tld) => tld.delayFibSum += Fib(tld.rng.Next(4, 10));
[MethodImpl(MethodImplOptions.NoInlining)]
private static int Fib(int n) => n <= 1 ? n : Fib(n - 2) + Fib(n - 1);
}
} |
General testing done:
|
Looks like Mono folks are already added, CC some more people
|
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLock.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/ThreadInt64PersistentCounter.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPoolEventSource.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPoolEventSource.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPoolEventSource.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPoolEventSource.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/FrameworkEventSource.cs
Outdated
Show resolved
Hide resolved
…ent source name in mono, add comments
- For a registered wait that is automatically unregistered (due to `executeOnlyOnce: true`), the registered wait handle gets added to the array of pending removals, and this automatic unregister does not wait for the removal to actually happen - If shortly after that a user calls `Unregister(null)` on the same registered wait handle, it is supposed to wait for the removal to actually happen, but was not because the handle is already in the array of pending removals - A `Dispose` on the wait handle shortly after `Unregister` returns would delete the safe handle and `DangerousRelease` upon removal would throw and crash the process - Fixed by waiting when a registered wait handle is pending removal, regardless of whether the caller of `Unregister` added the handle to the array of pending removals or if it was added by anyone else
… supported, fix assert for browser builds
…hem as parameters
Rebased to fix conflict |
Commit descriptions:
ThreadLocal<T>
, which turned out to be too slow for that purpose according to perf profilesCorresponding PR to update SOS: dotnet/diagnostics#1274
Fixes #32020