New exception handling on win-x86 #113985

filipnavara · 2025-03-27T21:08:17Z

.NET 9 shipped with new exception handling, based on the NativeAOT model, and a NativeAOT runtime support for win-x86 platform. It left us in a position where CoreCLR on win-x86 is the only platform remaining that uses the non-funclet exception model, and also the only platform that didn't get the new exception handling. @jkotas did some initial work updating CoreCLR to use funclets and new exception handling on x86. Based on this initial work I updated the prototype in #113576 to successfully pass the CoreCLR Pri0 and Libraries tests.

The ultimate goal would be to remove all non-funclet code from JIT and VM in order to simplify the code base. It may be aligned with a removal of the legacy exception handling as a part of a huge cleanup. That will, however, require commitment from all the stakeholders and solid plan on how to make the feature reach the product.

Until such plan and commitment happens I propose to upstream the relevant changes from the prototype to the point that enabling it will be flipping a compile time switch.

I'll use this issue in the next couple of weeks to post observations based on the prototype.

Implementation check list (incomplete):

Add win-x86 code paths for FEATURE_EH_FUNCLET builds to the runtime (Add support for building runtime with FEATURE_EH_FUNCLETS on win-x86 #114157)
Check whether we need to update exception handler registrations around VSD_ResolveWorker to mirror ThePreStub (Ref: Add support for building runtime with FEATURE_EH_FUNCLETS on win-x86 #114157)
Investigate and document performance issues
Update documentation

The text was updated successfully, but these errors were encountered:

filipnavara · 2025-03-27T21:10:54Z

Initial assesement of the generated code suggests that we will now benefit from a large range of optimizations which result in smaller code size (on average) and reduced stack space usage. However, these measurements may be offset by the GC info size since more methods are now generated as fully interruptible. This needs to be carefully inspected.

Largest regression in size is for the methods that use MethodImplOptions.Synchronized because they now get expanded to use finally blocks with Monitor.Exit like on other platforms.

filipnavara · 2025-03-27T21:11:29Z

cc @dotnet/jit-contrib @jkotas @janvorli

AaronRobinsonMSFT · 2025-03-27T22:33:42Z

@filipnavara Looking forward to this work. It will move us to a place where we can remove the Helper Method Frames in the IL_Throw and IL_Rethrow jithelpers.

See #95695.

filipnavara · 2025-04-01T11:28:21Z

Performance testing as of c440a50 shows that we still have bottlenecks in processing of deep stacks. This is not necessarily surprising because the code in EECodeManager for x86 was going to great length to optimize stack unwinding and reduce unnecessary GC info decoding.

Test code

using System.Diagnostics;
using System.Runtime.CompilerServices;

internal class Program
{
    static int exceptionsHandled = 0;

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CallMe(int i)
    {
        if (i == 0)
        {
            throw new NotImplementedException();
        }

        CallMe(i - 1);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CatchMe()
    {
        try
        {
            CallMe(10); // DEPTH: 10, 100, etc.
        }
        catch (NotImplementedException)
        {
            Interlocked.Increment(ref exceptionsHandled);
        }
    }

    private static void ThreadEntrypoint()
    {
        while (true)
        {
            CatchMe();
        }
    }

    private static void Main(string[] args)
    {
        int savedExceptionsHandled = 0;
        for (int i = 0; i < 10; i++)
        {
            new Thread(ThreadEntrypoint).Start();
        }
        Thread.Sleep(1000);
        savedExceptionsHandled = exceptionsHandled;
        Console.WriteLine($"Exceptions handled: {savedExceptionsHandled}");
        Environment.Exit(0);
    }
}

The test code throws an exception at various depths of stack (depending on configuration) and then catches it. This is run in 10 parallel threads and we measure how many exceptions we processed within a second. It's not using a benchmark framework so it includes some warm up and code tiering. I used a Ryzen 9950X machine to run the test, runtime and libraries were built in Release configuration. The table below includes the best value out of 5 consecutive runs, the variation between runs were minimal.

Stack depth	10	100
Old EH	961,717	360,500
New EH	1,198,822	180,385
New EH (w/o `PROCESS_EXPLICIT_FRAME_BEFORE_MANAGED_FRAME`)	1,404,827	223,177

(Update: Added the numbers with PROCESS_EXPLICIT_FRAME_BEFORE_MANAGED_FRAME disabled which is closer to the legacy x86 code paths. This is just meant to measure the impact of always unwinding an extra frame and taking some less optimized code paths. It doesn't work or pass the runtime tests as is.)

filipnavara · 2025-04-02T09:44:20Z

This is not necessarily surprising because the code in EECodeManager for x86 was going to great length to optimize stack unwinding and reduce unnecessary GC info decoding.

My current plan is to try adding cache for decoded GC info into EECodeInfo (instead of the existing CodeManState structure; for X86 only).

Update: On its own adding the cache to EECodeInfo has very modest improvement (~5%). It would yeild much bigger improvements if it was paired with PROCESS_EXPLICIT_FRAME_BEFORE_MANAGED_FRAME but making that configuration work is a substantial effort. One thing that did stood out in the profiles are frequent calls to EECodeManager::GetFunctionSize, particularly from JitTokenToMethodRegionInfo which in turn was called from a call chain originating in EECodeInfo::Init. This essentially means that the GC info for each method was almost always decoded at least twice. We can either refactor this to populate the cache early or rewrite GetFunctionSize to decode just the very first word in the x86 GC info. The later option yield another 10% savings.

Update 2: Profiles show that another 10%-15% of the regression is caused by this line:

runtime/src/coreclr/vm/exceptionhandling.cpp

Line 4023 in 1587221

    
           if (!ExecutionManager::IsManagedCode(GetIP(pThis->m_crawl.GetRegisterSet()->pCallerContext)))

We can probably use a cheaper detection at this specific spot (cached GC info would be already decoded, so we know if it's reverse P/Invoke, return addresses in CallDescrWorkerInternal / CallEHFilterFunclet / CallEHFunclet are known).

Update 3: Finally getting to the bottom of this. Seems like the StackTraceArray reallocations cause GC to kick in which stops the world. This is particularly visible in the multi-threaded benchmark. However, at least part of the speed regression exists in single threaded test as well but it's difficult to capture with a profiler since the ETW events start to skew the timings.

filipnavara · 2025-04-08T17:01:45Z

I am still working on analyzing the performance so I can paint a complete picture.

I've done a couple of micro-optimizations which help in reducing the stack walks overhead by around 30% with FEATURE_EH_FUNCLETS builds. The biggest one is submitted as PR #114170. I've also tried to micro-optimize few hot code paths (EECodeManager::IsFunclet 466dd7d, context copies d033b7b).

Analysis with a sampling profiler turned out to be difficult. Firstly, the new EH is faster at shallow call stacks and slower at deep call stacks. The implication being that the extra overhead is likely in the stack walking (and AppendExceptionStackFrame). I focused on profiling a code that throws an exception in 100-level deep call stack and then catches it. At the high level I see that some things take 30% more time but at the low level the building blocks are usually so fast that the running time is much smaller than the period of the sampling profiler and it's thus hard to measure accurately.

However, here are few observations:

Copying the CONTEXT structure from pCalleeContext to pCallerContext has measurable overhead. The structure is huge but we only track the integer+control registers so this is quite wasteful.
A runtime compiled WITHOUT FEATURE_EH_FUNCLETS but with REGDISPLAY structure changed to match the version with FEATURE_EH_FUNCLETS incurs significant overhead in some scenarios. While I didn't see much difference in the single threaded scenarios, in the multithreaded ones the performance was nearly twice as slow (using the "exceptions handled / second" metric for 10 threads continually throwing and catching exceptions over 100-deep call stack). I didn't study it in detail but I know that StackTraceInfo::AppendElement triggers a garbage collection quite often in this test case and overhead in stack walking contributes significantly to the stop-the-world time, so it's plausible that this could explain the regression.
With PR Replace CodeManState with cache in EECodeInfo for x86. #114170 we do roughly the same number of GC info decodings on FEATURE_EH_FUNCLETS runtime as the baseline one. Same goes for the number of stack unwind calls. This means we didn't introduce any structural error but unfortunately the individual calls are still so fast that I have trouble accurately measuring whether they are slower.
Making SfiNext + AppendExceptionStackFrame an FCall makes miniscule difference (~1% give or take) so it's unlikely that P/Invoke transitions are significant contributor to the cost.

filipnavara · 2025-04-09T21:41:27Z

Finally I got a profiler tool appropriate for the job. With Intel PIN it's possible to get relatively accurate instruction counts spent in different methods and diff it against the baseline: perf_diff.txt. It needs a bit of a manual analysis since the code paths between the two runtime builds are quite different.

Looks like the best bang for the buck optimizations to try are the following:

Use NOTIFY_ON_U2M_TRANSITIONS flag in SfiInit and invert the logic in SfiNext to detect unwinding into unmanaged frames. (Use NOTIFY_ON_U2M_TRANSITIONS in SfiInit/SfiNext #114496)
Get PROCESS_EXPLICIT_FRAME_BEFORE_MANAGED_FRAME working with new EH. There's quite a few places that just assume caller context in REGDISPLAY is valid without checking.

davidwrighton · 2025-04-11T23:28:35Z

I've managed to get 2 PRs out addressing performance issues.

PR #114580 adds a cache around the behavior of EECodeInfo::IsFunclet. I'm not entirely convinced adding ad hoc caches like this is worth it, but it is a nice performance win that should accrue to all platforms, although especially the Windows platforms.

PR #114582 adds a set of micro-optimizations found by examining the code generated for the Windows X86 platform. Some of these optimizations are isolated to the funclets work, but more of them are general performance improvements to this logic. Again, the optimizations are most important for Windows X86, but should have positive impacts on other platforms as well.

With both of these optimizations, I see 15%ish performance improvements, and with them combined I see performance getting within about 10% of the baseline .NET 9 scenario on Windows X86. If we are able to take both of them, I don't see a reason to hold this feature back based on performance. One detail to be aware of is that since we are changing logic significantly, the current set of profile guided optimization data for Windows X86 will be out of date when we turn this feature on, and it will take weeks to months before we have new data to fix that issue.

The issues that I plan to look at next week are the debugging scenarios, which unfortunately are mostly all only testable via internal test suites. I hope to find some time next week to take a look at them. Notably, I have significant concerns around the behavior of mixed mode stacks/throwing and catching across managed/native boundaries, and such. The contract between the runtime/debugger has historically been fairly fragile in this area, and changing EH models entirely may cause a large problem.

In addition, we'll need to bump the R2R version numbers so that old, non-funclet code won't be accepted by the new runtime, and the issues are the exception handler registrations need to be understood.

filipnavara · 2025-04-12T08:21:48Z

I've managed to get 2 PRs out addressing performance issues.

Firstly, huge thanks for looking into the performance issues. There's some small overlap with the optimizations I was testing locally but the two PRs really move the needle. Along with #114496 which was merged yesterday we're looking at roughly +15% instruction count on funclets vs baseline (.NET 10 main) for deep stack exception propagation. There are some smaller micro-optimizations which can reduce that even further, probably worth pursuing but not essential. Notably, the performance actually increases significanly for exception propagation in shallow stack traces. Once all the PRs are reviewed and merged I can run the benchmarks and share some graphs / tables. Notably, #14170 also improved the scenario by a small bit on non-funclet runtime, so comparing to .NET 9 we are now within ~10%, as stated above. I agree with the sentiment that this seems like an acceptable trade-off since we get improved performance and code quality in many other areas and scenarios, including more common exception handling use cases.

The issues that I plan to look at next week are the debugging scenarios, which unfortunately are mostly all only testable via internal test suites.

Help on this is much appreciated! I only did very limited tests with netcoredbg to ensure I'm not completely breaking the world, but that doesn't cover any mixed native/managed exceptions.

In addition, we'll need to bump the R2R version numbers...

Right, I'll bolt that on to the commit in #113576 so we don't forget about it.

filipnavara · 2025-04-24T05:16:43Z

Microbenchmarks from dotnet/performance

Tested with commit 199ae88 (main) and the x86funclets branch rebased on top.

Funclets are faster on all of them. None of the benchmarks excercise the pathological paths with deep stack traces. The "deep" stack trace benchmarks are only 10-level deep.

Method	Job	Branch	kind	Mean	Error	StdDev	Median	Min	Max	Ratio	MannWhitney(1%)	Gen0	Allocated	Alloc Ratio
ThrowAndCatch	Job-ICNAGN	x86funclets	Software	1.180 us	0.0077 us	0.0072 us	1.177 us	1.171 us	1.193 us	0.21	Faster	0.0328	172 B	1.00
ThrowAndCatch	Job-JLGMJO	main	Software	5.500 us	0.0378 us	0.0353 us	5.494 us	5.441 us	5.570 us	1.00	Baseline	0.0216	172 B	1.00

ThrowAndCatch_ManyCatchBlocks	Job-ICNAGN	x86funclets	Software	1.339 us	0.0037 us	0.0032 us	1.340 us	1.331 us	1.343 us	0.23	Faster	0.0322	172 B	1.00
ThrowAndCatch_ManyCatchBlocks	Job-JLGMJO	main	Software	5.706 us	0.0277 us	0.0231 us	5.713 us	5.643 us	5.738 us	1.00	Baseline	0.0226	172 B	1.00

ThrowAndCatchFinally	Job-ICNAGN	x86funclets	Software	1.183 us	0.0043 us	0.0040 us	1.182 us	1.178 us	1.191 us	0.22	Faster	0.0283	172 B	1.00
ThrowAndCatchFinally	Job-JLGMJO	main	Software	5.442 us	0.0317 us	0.0296 us	5.436 us	5.391 us	5.495 us	1.00	Baseline	0.0215	172 B	1.00

ThrowAndCatchWhen	Job-ICNAGN	x86funclets	Software	1.256 us	0.0064 us	0.0060 us	1.257 us	1.245 us	1.265 us	0.23	Faster	0.0301	172 B	1.00
ThrowAndCatchWhen	Job-JLGMJO	main	Software	5.520 us	0.0274 us	0.0256 us	5.519 us	5.476 us	5.558 us	1.00	Baseline	0.0218	172 B	1.00

ThrowAndCatchWhenFinally	Job-ICNAGN	x86funclets	Software	1.256 us	0.0067 us	0.0063 us	1.256 us	1.245 us	1.266 us	0.23	Faster	0.0298	172 B	1.00
ThrowAndCatchWhenFinally	Job-JLGMJO	main	Software	5.556 us	0.0343 us	0.0321 us	5.548 us	5.505 us	5.619 us	1.00	Baseline	0.0218	172 B	1.00

ThrowAndCatchDeep	Job-ICNAGN	x86funclets	Software	3.602 us	0.0131 us	0.0116 us	3.599 us	3.586 us	3.628 us	0.51	Faster	0.1282	692 B	1.00
ThrowAndCatchDeep	Job-JLGMJO	main	Software	7.123 us	0.0362 us	0.0338 us	7.118 us	7.075 us	7.194 us	1.00	Baseline	0.1128	692 B	1.00

ThrowAndCatchDeepRecursive	Job-ICNAGN	x86funclets	Software	3.831 us	0.0128 us	0.0120 us	3.830 us	3.814 us	3.853 us	0.53	Faster	0.1224	692 B	1.00
ThrowAndCatchDeepRecursive	Job-JLGMJO	main	Software	7.283 us	0.0296 us	0.0277 us	7.276 us	7.235 us	7.326 us	1.00	Baseline	0.1160	692 B	1.00

MultipleNestedTryCatch_FirstCatches	Job-ICNAGN	x86funclets	Software	1.178 us	0.0032 us	0.0028 us	1.179 us	1.173 us	1.184 us	0.22	Faster	0.0328	172 B	1.00
MultipleNestedTryCatch_FirstCatches	Job-JLGMJO	main	Software	5.448 us	0.0175 us	0.0146 us	5.447 us	5.421 us	5.476 us	1.00	Baseline	0.0216	172 B	1.00

MultipleNestedTryCatch_LastCatches	Job-ICNAGN	x86funclets	Software	1.294 us	0.0057 us	0.0053 us	1.294 us	1.286 us	1.303 us	0.23	Faster	0.0310	172 B	1.00
MultipleNestedTryCatch_LastCatches	Job-JLGMJO	main	Software	5.648 us	0.0315 us	0.0294 us	5.659 us	5.604 us	5.688 us	1.00	Baseline	0.0222	172 B	1.00

MultipleNestedTryFinally	Job-ICNAGN	x86funclets	Software	1.336 us	0.0066 us	0.0061 us	1.337 us	1.328 us	1.346 us	0.23	Faster	0.0324	172 B	1.00
MultipleNestedTryFinally	Job-JLGMJO	main	Software	5.726 us	0.0433 us	0.0405 us	5.730 us	5.647 us	5.794 us	1.00	Baseline	0.0225	172 B	1.00

CatchAndRethrowDeep	Job-ICNAGN	x86funclets	Software	15.519 us	0.0499 us	0.0417 us	15.522 us	15.433 us	15.583 us	0.26	Faster	0.1236	812 B	1.00
CatchAndRethrowDeep	Job-JLGMJO	main	Software	59.260 us	0.4628 us	0.4329 us	59.243 us	58.495 us	59.876 us	1.00	Baseline	-	812 B	1.00

CatchAndThrowOtherDeep	Job-ICNAGN	x86funclets	Software	19.625 us	0.1129 us	0.1056 us	19.621 us	19.466 us	19.837 us	0.32	Faster	0.3129	1892 B	1.00
CatchAndThrowOtherDeep	Job-JLGMJO	main	Software	61.255 us	0.3391 us	0.3172 us	61.209 us	60.603 us	61.619 us	1.00	Baseline	0.2413	1892 B	1.00

TryAndFinallyDeep	Job-ICNAGN	x86funclets	Software	4.918 us	0.0231 us	0.0216 us	4.919 us	4.875 us	4.964 us	0.60	Faster	0.1180	692 B	1.00
TryAndFinallyDeep	Job-JLGMJO	main	Software	8.200 us	0.0668 us	0.0592 us	8.185 us	8.142 us	8.341 us	1.00	Baseline	0.1303	692 B	1.00

TryAndCatchDeep_CaugtAtTheTop	Job-ICNAGN	x86funclets	Software	4.729 us	0.0242 us	0.0189 us	4.730 us	4.699 us	4.757 us	0.61	Faster	0.1306	692 B	1.00
TryAndCatchDeep_CaugtAtTheTop	Job-JLGMJO	main	Software	7.791 us	0.0450 us	0.0399 us	7.777 us	7.736 us	7.882 us	1.00	Baseline	0.1225	692 B	1.00

ThrowAndCatch	Job-ICNAGN	x86funclets	Hardware	3.585 us	0.0180 us	0.0168 us	3.584 us	3.560 us	3.618 us	0.63	Faster	0.0282	172 B	1.00
ThrowAndCatch	Job-JLGMJO	main	Hardware	5.662 us	0.0184 us	0.0172 us	5.658 us	5.639 us	5.701 us	1.00	Baseline	0.0222	172 B	1.00

ThrowAndCatch_ManyCatchBlocks	Job-ICNAGN	x86funclets	Hardware	3.730 us	0.0140 us	0.0124 us	3.728 us	3.707 us	3.756 us	0.63	Faster	0.0295	172 B	1.00
ThrowAndCatch_ManyCatchBlocks	Job-JLGMJO	main	Hardware	5.939 us	0.0181 us	0.0151 us	5.935 us	5.924 us	5.972 us	1.00	Baseline	0.0235	172 B	1.00

ThrowAndCatchFinally	Job-ICNAGN	x86funclets	Hardware	3.552 us	0.0134 us	0.0125 us	3.552 us	3.532 us	3.580 us	0.63	Faster	0.0279	172 B	1.00
ThrowAndCatchFinally	Job-JLGMJO	main	Hardware	5.630 us	0.0364 us	0.0341 us	5.628 us	5.587 us	5.698 us	1.00	Baseline	0.0221	172 B	1.00

ThrowAndCatchWhen	Job-ICNAGN	x86funclets	Hardware	3.623 us	0.0187 us	0.0166 us	3.625 us	3.587 us	3.647 us	0.63	Faster	0.0286	172 B	1.00
ThrowAndCatchWhen	Job-JLGMJO	main	Hardware	5.712 us	0.0395 us	0.0370 us	5.733 us	5.657 us	5.762 us	1.00	Baseline	0.0224	172 B	1.00

ThrowAndCatchWhenFinally	Job-ICNAGN	x86funclets	Hardware	3.639 us	0.0178 us	0.0166 us	3.634 us	3.622 us	3.670 us	0.63	Faster	0.0288	172 B	1.00
ThrowAndCatchWhenFinally	Job-JLGMJO	main	Hardware	5.735 us	0.0187 us	0.0175 us	5.733 us	5.703 us	5.765 us	1.00	Baseline	0.0225	172 B	1.00

ThrowAndCatchDeep	Job-ICNAGN	x86funclets	Hardware	6.005 us	0.0236 us	0.0220 us	5.999 us	5.976 us	6.045 us	0.82	Faster	0.1199	692 B	1.00
ThrowAndCatchDeep	Job-JLGMJO	main	Hardware	7.301 us	0.0300 us	0.0280 us	7.306 us	7.245 us	7.334 us	1.00	Baseline	0.1150	692 B	1.00

ThrowAndCatchDeepRecursive	Job-ICNAGN	x86funclets	Hardware	6.260 us	0.0227 us	0.0212 us	6.259 us	6.219 us	6.305 us	0.84	Faster	0.1235	692 B	1.00
ThrowAndCatchDeepRecursive	Job-JLGMJO	main	Hardware	7.428 us	0.0245 us	0.0229 us	7.433 us	7.388 us	7.464 us	1.00	Baseline	0.1175	692 B	1.00

MultipleNestedTryCatch_FirstCatches	Job-ICNAGN	x86funclets	Hardware	3.556 us	0.0186 us	0.0174 us	3.555 us	3.528 us	3.582 us	0.63	Faster	0.0280	172 B	1.00
MultipleNestedTryCatch_FirstCatches	Job-JLGMJO	main	Hardware	5.646 us	0.0182 us	0.0170 us	5.650 us	5.617 us	5.673 us	1.00	Baseline	0.0221	172 B	1.00

MultipleNestedTryCatch_LastCatches	Job-ICNAGN	x86funclets	Hardware	3.687 us	0.0172 us	0.0160 us	3.679 us	3.670 us	3.714 us	0.63	Faster	0.0291	172 B	1.00
MultipleNestedTryCatch_LastCatches	Job-JLGMJO	main	Hardware	5.879 us	0.0225 us	0.0210 us	5.880 us	5.833 us	5.910 us	1.00	Baseline	0.0231	172 B	1.00

MultipleNestedTryFinally	Job-ICNAGN	x86funclets	Hardware	3.683 us	0.0197 us	0.0184 us	3.680 us	3.652 us	3.712 us	0.62	Faster	0.0291	172 B	1.00
MultipleNestedTryFinally	Job-JLGMJO	main	Hardware	5.920 us	0.0228 us	0.0202 us	5.918 us	5.884 us	5.961 us	1.00	Baseline	0.0232	172 B	1.00

CatchAndRethrowDeep	Job-ICNAGN	x86funclets	Hardware	18.051 us	0.0764 us	0.0715 us	18.053 us	17.930 us	18.172 us	0.30	Faster	0.1435	812 B	1.00
CatchAndRethrowDeep	Job-JLGMJO	main	Hardware	60.568 us	0.1999 us	0.1772 us	60.599 us	60.311 us	60.777 us	1.00	Baseline	-	812 B	1.00

CatchAndThrowOtherDeep	Job-ICNAGN	x86funclets	Hardware	21.770 us	0.0543 us	0.0508 us	21.767 us	21.690 us	21.865 us	0.35	Faster	0.3463	1892 B	1.00
CatchAndThrowOtherDeep	Job-JLGMJO	main	Hardware	61.553 us	0.3008 us	0.2511 us	61.467 us	61.262 us	62.120 us	1.00	Baseline	0.2441	1892 B	1.00

TryAndFinallyDeep	Job-ICNAGN	x86funclets	Hardware	7.182 us	0.0257 us	0.0228 us	7.181 us	7.133 us	7.218 us	0.85	Faster	0.1141	692 B	1.00
TryAndFinallyDeep	Job-JLGMJO	main	Hardware	8.489 us	0.0238 us	0.0223 us	8.490 us	8.447 us	8.529 us	1.00	Baseline	0.1005	692 B	1.00

TryAndCatchDeep_CaugtAtTheTop	Job-ICNAGN	x86funclets	Hardware	7.105 us	0.0306 us	0.0271 us	7.103 us	7.072 us	7.158 us	0.88	Faster	0.1123	692 B	1.00
TryAndCatchDeep_CaugtAtTheTop	Job-JLGMJO	main	Hardware	8.079 us	0.0295 us	0.0276 us	8.079 us	8.036 us	8.127 us	1.00	Baseline	0.1270	692 B	1.00

ThrowAndCatch	Job-ICNAGN	x86funclets	ReflectionSoftware	6.111 us	0.0294 us	0.0261 us	6.115 us	6.047 us	6.155 us	0.52	Faster	0.0969	540 B	1.00
ThrowAndCatch	Job-JLGMJO	main	ReflectionSoftware	11.742 us	0.0505 us	0.0448 us	11.739 us	11.646 us	11.827 us	1.00	Baseline	0.0923	540 B	1.00

ThrowAndCatch_ManyCatchBlocks	Job-ICNAGN	x86funclets	ReflectionSoftware	6.191 us	0.0355 us	0.0332 us	6.177 us	6.128 us	6.239 us	0.52	Faster	0.0989	540 B	1.00
ThrowAndCatch_ManyCatchBlocks	Job-JLGMJO	main	ReflectionSoftware	11.958 us	0.0711 us	0.0630 us	11.932 us	11.886 us	12.102 us	1.00	Baseline	0.0946	540 B	1.00

ThrowAndCatchDeep	Job-ICNAGN	x86funclets	ReflectionSoftware	8.399 us	0.0775 us	0.0725 us	8.366 us	8.315 us	8.589 us	0.63	Faster	0.1677	972 B	1.00
ThrowAndCatchDeep	Job-JLGMJO	main	ReflectionSoftware	13.331 us	0.0576 us	0.0538 us	13.326 us	13.231 us	13.412 us	1.00	Baseline	0.1585	972 B	1.00

ThrowAndCatchDeepRecursive	Job-ICNAGN	x86funclets	ReflectionSoftware	8.664 us	0.0288 us	0.0270 us	8.671 us	8.598 us	8.694 us	0.64	Faster	0.1738	972 B	1.00
ThrowAndCatchDeepRecursive	Job-JLGMJO	main	ReflectionSoftware	13.483 us	0.0682 us	0.0638 us	13.472 us	13.394 us	13.611 us	1.00	Baseline	0.1593	972 B	1.00

ThrowAndCatch	Job-ICNAGN	x86funclets	ReflectionHardware	8.433 us	0.0378 us	0.0335 us	8.432 us	8.373 us	8.496 us	0.71	Faster	0.1005	540 B	1.00
ThrowAndCatch	Job-JLGMJO	main	ReflectionHardware	11.957 us	0.0591 us	0.0553 us	11.938 us	11.892 us	12.069 us	1.00	Baseline	0.0945	540 B	1.00

ThrowAndCatch_ManyCatchBlocks	Job-ICNAGN	x86funclets	ReflectionHardware	8.597 us	0.0356 us	0.0333 us	8.588 us	8.557 us	8.664 us	0.70	Faster	0.1020	540 B	1.00
ThrowAndCatch_ManyCatchBlocks	Job-JLGMJO	main	ReflectionHardware	12.307 us	0.0683 us	0.0639 us	12.303 us	12.213 us	12.422 us	1.00	Baseline	0.0968	540 B	1.00

ThrowAndCatchDeep	Job-ICNAGN	x86funclets	ReflectionHardware	10.828 us	0.0385 us	0.0360 us	10.824 us	10.737 us	10.881 us	0.80	Faster	0.1728	972 B	1.00
ThrowAndCatchDeep	Job-JLGMJO	main	ReflectionHardware	13.512 us	0.0582 us	0.0544 us	13.521 us	13.404 us	13.587 us	1.00	Baseline	0.1612	972 B	1.00

ThrowAndCatchDeepRecursive	Job-ICNAGN	x86funclets	ReflectionHardware	11.031 us	0.0435 us	0.0363 us	11.036 us	10.965 us	11.097 us	0.80	Faster	0.1743	972 B	1.00
ThrowAndCatchDeepRecursive	Job-JLGMJO	main	ReflectionHardware	13.704 us	0.0720 us	0.0673 us	13.713 us	13.596 us	13.807 us	1.00	Baseline	0.1623	972 B	1.00

filipnavara · 2025-04-24T05:31:12Z

Deep stack trace benchmark

Source code

    static int exceptionsHandled = 0;

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CallMe(int i, NotImplementedException inEx)
    {
        if (i == 0)
        {
            throw inEx;
        }

        CallMe(i - 1, inEx);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CatchMe(NotImplementedException inEx, int depth = 100)
    {
        try
        {
            CallMe(depth, inEx);
        }
        catch (NotImplementedException)
        {
            Interlocked.Increment(ref exceptionsHandled);
        }
    }

    [Params(1, 10, 100, 1000)]
    public int Depth;

    [Benchmark]
    public void CatchMe1000x100()
    {
        CatchMe(new NotImplementedException(), Depth);
    }

Main:

Method	Depth	Mean	Error	StdDev
CatchMe1000x100	1	5.636 us	0.0139 us	0.0130 us
CatchMe1000x100	10	7.029 us	0.0163 us	0.0145 us
CatchMe1000x100	100	20.581 us	0.0601 us	0.0562 us
CatchMe1000x100	1000	153.568 us	0.4248 us	0.3974 us

Funclets:

Method	Depth	Mean	Error	StdDev
CatchMe1000x100	1	1.488 us	0.0035 us	0.0033 us
CatchMe1000x100	10	3.657 us	0.0115 us	0.0102 us
CatchMe1000x100	100	23.606 us	0.1366 us	0.1211 us
CatchMe1000x100	1000	220.143 us	0.7091 us	0.6633 us

filipnavara · 2025-04-24T05:43:08Z

Handled exceptions/second metric

This is essentially the same non-scientific benchmark shared in #113985 (comment). We run 10 threads, each of them is throwing an exception at deep stack trace and catching it at the top of the stack trace.

This is showing pretty much the worst case scenario (at depth 100) and also the improvements for the more common scenarios. It also shows that we don't introduce any significant bottleneck by taking some additional lock.

Source code

using System.Diagnostics;
using System.Runtime.CompilerServices;

public class Program
{
    static int exceptionsHandled = 0;

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CallMe(int i, NotImplementedException inEx)
    {
        if (i == 0)
        {
            throw inEx;
        }

        CallMe(i - 1, inEx);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CatchMe(NotImplementedException inEx, int depth = 100)
    {
        try
        {
            CallMe(depth, inEx);
        }
        catch (NotImplementedException)
        {
            Interlocked.Increment(ref exceptionsHandled);
        }
    }

    private static void ThreadEntrypoint()
    {
        while (true)
        {
            NotImplementedException inEx = new NotImplementedException();
            CatchMe(inEx, 100);
        }
    }

    public static void Main(string[] args)
    {
        exceptionsHandled = 0; 
    
        Stopwatch sw = Stopwatch.StartNew();
        int savedExceptionsHandled = 0;
        int maxThreads = action == "st" ? 1 : 10;
        for (int i = 0; i < maxThreads; i++)
        {
            new Thread(ThreadEntrypoint).Start();
        }
        Thread.Sleep(1000);
        savedExceptionsHandled = Volatile.Read(ref exceptionsHandled);
        sw.Stop();
        Console.WriteLine($"Exceptions handled: {savedExceptionsHandled}");
        Console.WriteLine($"Time: {sw.Elapsed}");
        Console.WriteLine($"Normalized: {(int)(savedExceptionsHandled * 1000 / sw.ElapsedMilliseconds)}");
        Environment.Exit(0);
    }
}

Depth: 100

Main: 386,344 ex/sec
.NET 9: 378,816 ex/sec
Funclets: 343,814 ex/sec

Depth: 50

Main: 599,575 ex/sec
Funclets: 646,922 ex/sec

Depth: 10

Main: 931,671 ex/sec
Funclets: 1,645,489 ex/sec

am11 · 2025-05-02T22:02:05Z

The issues that I plan to look at next week are the debugging scenarios, which unfortunately are mostly all only testable via internal test suites. I hope to find some time next week to take a look at them. Notably, I have significant concerns around the behavior of mixed mode stacks/throwing and catching across managed/native boundaries, and such. The contract between the runtime/debugger has historically been fairly fragile in this area, and changing EH models entirely may cause a large problem.

@davidwrighton, would it be possible to get this validated in the current preview time-frame? It will allow us to remove a few more HMFs from jithelpers.

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 27, 2025

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Mar 27, 2025

filipnavara added area-ExceptionHandling-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Mar 27, 2025

filipnavara mentioned this issue Apr 2, 2025

Add support for building runtime with FEATURE_EH_FUNCLETS on win-x86 #114157

Merged

filipnavara mentioned this issue Apr 13, 2025

Use [UN]INSTALL_MANAGED_EXCEPTION_DISPATCHER_EX to backpatch CallDescrWorkerInternal SEH record on x86/funclets #114600

Merged

am11 mentioned this issue Apr 23, 2025

[RuntimeAsync] Feedback from the merging PR in the runtime repo. dotnet/runtimelab#3095

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New exception handling on win-x86 #113985

New exception handling on win-x86 #113985

filipnavara commented Mar 27, 2025 •

edited

Loading

filipnavara commented Mar 27, 2025

filipnavara commented Mar 27, 2025

AaronRobinsonMSFT commented Mar 27, 2025

filipnavara commented Apr 1, 2025 •

edited

Loading

filipnavara commented Apr 2, 2025 •

edited

Loading

filipnavara commented Apr 8, 2025 •

edited

Loading

filipnavara commented Apr 9, 2025 •

edited

Loading

davidwrighton commented Apr 11, 2025 •

edited

Loading

filipnavara commented Apr 12, 2025

filipnavara commented Apr 24, 2025

filipnavara commented Apr 24, 2025

filipnavara commented Apr 24, 2025 •

edited

Loading

am11 commented May 2, 2025

New exception handling on win-x86 #113985

New exception handling on win-x86 #113985

Comments

filipnavara commented Mar 27, 2025 • edited Loading

filipnavara commented Mar 27, 2025

filipnavara commented Mar 27, 2025

AaronRobinsonMSFT commented Mar 27, 2025

filipnavara commented Apr 1, 2025 • edited Loading

filipnavara commented Apr 2, 2025 • edited Loading

filipnavara commented Apr 8, 2025 • edited Loading

filipnavara commented Apr 9, 2025 • edited Loading

davidwrighton commented Apr 11, 2025 • edited Loading

filipnavara commented Apr 12, 2025

filipnavara commented Apr 24, 2025

Microbenchmarks from dotnet/performance

filipnavara commented Apr 24, 2025

Deep stack trace benchmark

filipnavara commented Apr 24, 2025 • edited Loading

Handled exceptions/second metric

Depth: 100

Depth: 50

Depth: 10

am11 commented May 2, 2025

filipnavara commented Mar 27, 2025 •

edited

Loading

filipnavara commented Apr 1, 2025 •

edited

Loading

filipnavara commented Apr 2, 2025 •

edited

Loading

filipnavara commented Apr 8, 2025 •

edited

Loading

filipnavara commented Apr 9, 2025 •

edited

Loading

davidwrighton commented Apr 11, 2025 •

edited

Loading

filipnavara commented Apr 24, 2025 •

edited

Loading