-
Notifications
You must be signed in to change notification settings - Fork 5k
New exception handling on win-x86 #113985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Initial assesement of the generated code suggests that we will now benefit from a large range of optimizations which result in smaller code size (on average) and reduced stack space usage. However, these measurements may be offset by the GC info size since more methods are now generated as fully interruptible. This needs to be carefully inspected. Largest regression in size is for the methods that use |
@filipnavara Looking forward to this work. It will move us to a place where we can remove the Helper Method Frames in the See #95695. |
Performance testing as of c440a50 shows that we still have bottlenecks in processing of deep stacks. This is not necessarily surprising because the code in Test codeusing System.Diagnostics;
using System.Runtime.CompilerServices;
internal class Program
{
static int exceptionsHandled = 0;
[MethodImpl(MethodImplOptions.NoInlining)]
private static void CallMe(int i)
{
if (i == 0)
{
throw new NotImplementedException();
}
CallMe(i - 1);
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void CatchMe()
{
try
{
CallMe(10); // DEPTH: 10, 100, etc.
}
catch (NotImplementedException)
{
Interlocked.Increment(ref exceptionsHandled);
}
}
private static void ThreadEntrypoint()
{
while (true)
{
CatchMe();
}
}
private static void Main(string[] args)
{
int savedExceptionsHandled = 0;
for (int i = 0; i < 10; i++)
{
new Thread(ThreadEntrypoint).Start();
}
Thread.Sleep(1000);
savedExceptionsHandled = exceptionsHandled;
Console.WriteLine($"Exceptions handled: {savedExceptionsHandled}");
Environment.Exit(0);
}
} The test code throws an exception at various depths of stack (depending on configuration) and then catches it. This is run in 10 parallel threads and we measure how many exceptions we processed within a second. It's not using a benchmark framework so it includes some warm up and code tiering. I used a Ryzen 9950X machine to run the test, runtime and libraries were built in Release configuration. The table below includes the best value out of 5 consecutive runs, the variation between runs were minimal.
(Update: Added the numbers with |
My current plan is to try adding cache for decoded GC info into Update: On its own adding the cache to Update 2: Profiles show that another 10%-15% of the regression is caused by this line: runtime/src/coreclr/vm/exceptionhandling.cpp Line 4023 in 1587221
We can probably use a cheaper detection at this specific spot (cached GC info would be already decoded, so we know if it's reverse P/Invoke, return addresses in CallDescrWorkerInternal / CallEHFilterFunclet / CallEHFunclet are known).
Update 3: Finally getting to the bottom of this. Seems like the |
I am still working on analyzing the performance so I can paint a complete picture. I've done a couple of micro-optimizations which help in reducing the stack walks overhead by around 30% with Analysis with a sampling profiler turned out to be difficult. Firstly, the new EH is faster at shallow call stacks and slower at deep call stacks. The implication being that the extra overhead is likely in the stack walking (and However, here are few observations:
|
Finally I got a profiler tool appropriate for the job. With Intel PIN it's possible to get relatively accurate instruction counts spent in different methods and diff it against the baseline: perf_diff.txt. It needs a bit of a manual analysis since the code paths between the two runtime builds are quite different. Looks like the best bang for the buck optimizations to try are the following:
|
I've managed to get 2 PRs out addressing performance issues. PR #114580 adds a cache around the behavior of EECodeInfo::IsFunclet. I'm not entirely convinced adding ad hoc caches like this is worth it, but it is a nice performance win that should accrue to all platforms, although especially the Windows platforms. PR #114582 adds a set of micro-optimizations found by examining the code generated for the Windows X86 platform. Some of these optimizations are isolated to the funclets work, but more of them are general performance improvements to this logic. Again, the optimizations are most important for Windows X86, but should have positive impacts on other platforms as well. With both of these optimizations, I see 15%ish performance improvements, and with them combined I see performance getting within about 10% of the baseline .NET 9 scenario on Windows X86. If we are able to take both of them, I don't see a reason to hold this feature back based on performance. One detail to be aware of is that since we are changing logic significantly, the current set of profile guided optimization data for Windows X86 will be out of date when we turn this feature on, and it will take weeks to months before we have new data to fix that issue. The issues that I plan to look at next week are the debugging scenarios, which unfortunately are mostly all only testable via internal test suites. I hope to find some time next week to take a look at them. Notably, I have significant concerns around the behavior of mixed mode stacks/throwing and catching across managed/native boundaries, and such. The contract between the runtime/debugger has historically been fairly fragile in this area, and changing EH models entirely may cause a large problem. In addition, we'll need to bump the R2R version numbers so that old, non-funclet code won't be accepted by the new runtime, and the issues are the exception handler registrations need to be understood. |
Firstly, huge thanks for looking into the performance issues. There's some small overlap with the optimizations I was testing locally but the two PRs really move the needle. Along with #114496 which was merged yesterday we're looking at roughly +15% instruction count on funclets vs baseline (.NET 10 main) for deep stack exception propagation. There are some smaller micro-optimizations which can reduce that even further, probably worth pursuing but not essential. Notably, the performance actually increases significanly for exception propagation in shallow stack traces. Once all the PRs are reviewed and merged I can run the benchmarks and share some graphs / tables. Notably, #14170 also improved the scenario by a small bit on non-funclet runtime, so comparing to .NET 9 we are now within ~10%, as stated above. I agree with the sentiment that this seems like an acceptable trade-off since we get improved performance and code quality in many other areas and scenarios, including more common exception handling use cases.
Help on this is much appreciated! I only did very limited tests with netcoredbg to ensure I'm not completely breaking the world, but that doesn't cover any mixed native/managed exceptions.
Right, I'll bolt that on to the commit in #113576 so we don't forget about it. |
Microbenchmarks from dotnet/performanceTested with commit 199ae88 (main) and the x86funclets branch rebased on top. Funclets are faster on all of them. None of the benchmarks excercise the pathological paths with deep stack traces. The "deep" stack trace benchmarks are only 10-level deep.
|
Deep stack trace benchmarkSource code static int exceptionsHandled = 0;
[MethodImpl(MethodImplOptions.NoInlining)]
private static void CallMe(int i, NotImplementedException inEx)
{
if (i == 0)
{
throw inEx;
}
CallMe(i - 1, inEx);
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void CatchMe(NotImplementedException inEx, int depth = 100)
{
try
{
CallMe(depth, inEx);
}
catch (NotImplementedException)
{
Interlocked.Increment(ref exceptionsHandled);
}
}
[Params(1, 10, 100, 1000)]
public int Depth;
[Benchmark]
public void CatchMe1000x100()
{
CatchMe(new NotImplementedException(), Depth);
} Main:
Funclets:
|
Handled exceptions/second metricThis is essentially the same non-scientific benchmark shared in #113985 (comment). We run 10 threads, each of them is throwing an exception at deep stack trace and catching it at the top of the stack trace. This is showing pretty much the worst case scenario (at depth 100) and also the improvements for the more common scenarios. It also shows that we don't introduce any significant bottleneck by taking some additional lock. Source codeusing System.Diagnostics;
using System.Runtime.CompilerServices;
public class Program
{
static int exceptionsHandled = 0;
[MethodImpl(MethodImplOptions.NoInlining)]
private static void CallMe(int i, NotImplementedException inEx)
{
if (i == 0)
{
throw inEx;
}
CallMe(i - 1, inEx);
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void CatchMe(NotImplementedException inEx, int depth = 100)
{
try
{
CallMe(depth, inEx);
}
catch (NotImplementedException)
{
Interlocked.Increment(ref exceptionsHandled);
}
}
private static void ThreadEntrypoint()
{
while (true)
{
NotImplementedException inEx = new NotImplementedException();
CatchMe(inEx, 100);
}
}
public static void Main(string[] args)
{
exceptionsHandled = 0;
Stopwatch sw = Stopwatch.StartNew();
int savedExceptionsHandled = 0;
int maxThreads = action == "st" ? 1 : 10;
for (int i = 0; i < maxThreads; i++)
{
new Thread(ThreadEntrypoint).Start();
}
Thread.Sleep(1000);
savedExceptionsHandled = Volatile.Read(ref exceptionsHandled);
sw.Stop();
Console.WriteLine($"Exceptions handled: {savedExceptionsHandled}");
Console.WriteLine($"Time: {sw.Elapsed}");
Console.WriteLine($"Normalized: {(int)(savedExceptionsHandled * 1000 / sw.ElapsedMilliseconds)}");
Environment.Exit(0);
}
} Depth: 100Main: 386,344 ex/sec Depth: 50Main: 599,575 ex/sec Depth: 10Main: 931,671 ex/sec |
@davidwrighton, would it be possible to get this validated in the current preview time-frame? It will allow us to remove a few more HMFs from jithelpers. |
.NET 9 shipped with new exception handling, based on the NativeAOT model, and a NativeAOT runtime support for win-x86 platform. It left us in a position where CoreCLR on win-x86 is the only platform remaining that uses the non-funclet exception model, and also the only platform that didn't get the new exception handling. @jkotas did some initial work updating CoreCLR to use funclets and new exception handling on x86. Based on this initial work I updated the prototype in #113576 to successfully pass the CoreCLR Pri0 and Libraries tests.
The ultimate goal would be to remove all non-funclet code from JIT and VM in order to simplify the code base. It may be aligned with a removal of the legacy exception handling as a part of a huge cleanup. That will, however, require commitment from all the stakeholders and solid plan on how to make the feature reach the product.
Until such plan and commitment happens I propose to upstream the relevant changes from the prototype to the point that enabling it will be flipping a compile time switch.
I'll use this issue in the next couple of weeks to post observations based on the prototype.
Implementation check list (incomplete):
FEATURE_EH_FUNCLET
builds to the runtime (Add support for building runtime with FEATURE_EH_FUNCLETS on win-x86Β #114157)VSD_ResolveWorker
to mirrorThePreStub
(Ref: Add support for building runtime with FEATURE_EH_FUNCLETS on win-x86Β #114157)The text was updated successfully, but these errors were encountered: