Thanks to visit codestin.com
Credit goes to github.com

Skip to content

New exception handling on win-x86 #113985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 of 4 tasks
filipnavara opened this issue Mar 27, 2025 · 13 comments
Open
3 of 4 tasks

New exception handling on win-x86 #113985

filipnavara opened this issue Mar 27, 2025 · 13 comments
Labels
area-ExceptionHandling-coreclr untriaged New issue has not been triaged by the area owner

Comments

@filipnavara
Copy link
Member

filipnavara commented Mar 27, 2025

.NET 9 shipped with new exception handling, based on the NativeAOT model, and a NativeAOT runtime support for win-x86 platform. It left us in a position where CoreCLR on win-x86 is the only platform remaining that uses the non-funclet exception model, and also the only platform that didn't get the new exception handling. @jkotas did some initial work updating CoreCLR to use funclets and new exception handling on x86. Based on this initial work I updated the prototype in #113576 to successfully pass the CoreCLR Pri0 and Libraries tests.

The ultimate goal would be to remove all non-funclet code from JIT and VM in order to simplify the code base. It may be aligned with a removal of the legacy exception handling as a part of a huge cleanup. That will, however, require commitment from all the stakeholders and solid plan on how to make the feature reach the product.

Until such plan and commitment happens I propose to upstream the relevant changes from the prototype to the point that enabling it will be flipping a compile time switch.

I'll use this issue in the next couple of weeks to post observations based on the prototype.


Implementation check list (incomplete):

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 27, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Mar 27, 2025
@filipnavara filipnavara added area-ExceptionHandling-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Mar 27, 2025
@filipnavara
Copy link
Member Author

Initial assesement of the generated code suggests that we will now benefit from a large range of optimizations which result in smaller code size (on average) and reduced stack space usage. However, these measurements may be offset by the GC info size since more methods are now generated as fully interruptible. This needs to be carefully inspected.

Largest regression in size is for the methods that use MethodImplOptions.Synchronized because they now get expanded to use finally blocks with Monitor.Exit like on other platforms.

@filipnavara
Copy link
Member Author

cc @dotnet/jit-contrib @jkotas @janvorli

@AaronRobinsonMSFT
Copy link
Member

@filipnavara Looking forward to this work. It will move us to a place where we can remove the Helper Method Frames in the IL_Throw and IL_Rethrow jithelpers.

See #95695.

@filipnavara
Copy link
Member Author

filipnavara commented Apr 1, 2025

Performance testing as of c440a50 shows that we still have bottlenecks in processing of deep stacks. This is not necessarily surprising because the code in EECodeManager for x86 was going to great length to optimize stack unwinding and reduce unnecessary GC info decoding.

Test code
using System.Diagnostics;
using System.Runtime.CompilerServices;

internal class Program
{
    static int exceptionsHandled = 0;

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CallMe(int i)
    {
        if (i == 0)
        {
            throw new NotImplementedException();
        }

        CallMe(i - 1);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CatchMe()
    {
        try
        {
            CallMe(10); // DEPTH: 10, 100, etc.
        }
        catch (NotImplementedException)
        {
            Interlocked.Increment(ref exceptionsHandled);
        }
    }

    private static void ThreadEntrypoint()
    {
        while (true)
        {
            CatchMe();
        }
    }

    private static void Main(string[] args)
    {
        int savedExceptionsHandled = 0;
        for (int i = 0; i < 10; i++)
        {
            new Thread(ThreadEntrypoint).Start();
        }
        Thread.Sleep(1000);
        savedExceptionsHandled = exceptionsHandled;
        Console.WriteLine($"Exceptions handled: {savedExceptionsHandled}");
        Environment.Exit(0);
    }
}

The test code throws an exception at various depths of stack (depending on configuration) and then catches it. This is run in 10 parallel threads and we measure how many exceptions we processed within a second. It's not using a benchmark framework so it includes some warm up and code tiering. I used a Ryzen 9950X machine to run the test, runtime and libraries were built in Release configuration. The table below includes the best value out of 5 consecutive runs, the variation between runs were minimal.

Stack depth 10 100
Old EH 961,717 360,500
New EH 1,198,822 180,385
New EH (w/o PROCESS_EXPLICIT_FRAME_BEFORE_MANAGED_FRAME) 1,404,827 223,177

(Update: Added the numbers with PROCESS_EXPLICIT_FRAME_BEFORE_MANAGED_FRAME disabled which is closer to the legacy x86 code paths. This is just meant to measure the impact of always unwinding an extra frame and taking some less optimized code paths. It doesn't work or pass the runtime tests as is.)

@filipnavara
Copy link
Member Author

filipnavara commented Apr 2, 2025

This is not necessarily surprising because the code in EECodeManager for x86 was going to great length to optimize stack unwinding and reduce unnecessary GC info decoding.

My current plan is to try adding cache for decoded GC info into EECodeInfo (instead of the existing CodeManState structure; for X86 only).

Update: On its own adding the cache to EECodeInfo has very modest improvement (~5%). It would yeild much bigger improvements if it was paired with PROCESS_EXPLICIT_FRAME_BEFORE_MANAGED_FRAME but making that configuration work is a substantial effort. One thing that did stood out in the profiles are frequent calls to EECodeManager::GetFunctionSize, particularly from JitTokenToMethodRegionInfo which in turn was called from a call chain originating in EECodeInfo::Init. This essentially means that the GC info for each method was almost always decoded at least twice. We can either refactor this to populate the cache early or rewrite GetFunctionSize to decode just the very first word in the x86 GC info. The later option yield another 10% savings.

Update 2: Profiles show that another 10%-15% of the regression is caused by this line:

if (!ExecutionManager::IsManagedCode(GetIP(pThis->m_crawl.GetRegisterSet()->pCallerContext)))

We can probably use a cheaper detection at this specific spot (cached GC info would be already decoded, so we know if it's reverse P/Invoke, return addresses in CallDescrWorkerInternal / CallEHFilterFunclet / CallEHFunclet are known).

Update 3: Finally getting to the bottom of this. Seems like the StackTraceArray reallocations cause GC to kick in which stops the world. This is particularly visible in the multi-threaded benchmark. However, at least part of the speed regression exists in single threaded test as well but it's difficult to capture with a profiler since the ETW events start to skew the timings.

@filipnavara
Copy link
Member Author

filipnavara commented Apr 8, 2025

I am still working on analyzing the performance so I can paint a complete picture.

I've done a couple of micro-optimizations which help in reducing the stack walks overhead by around 30% with FEATURE_EH_FUNCLETS builds. The biggest one is submitted as PR #114170. I've also tried to micro-optimize few hot code paths (EECodeManager::IsFunclet 466dd7d, context copies d033b7b).

Analysis with a sampling profiler turned out to be difficult. Firstly, the new EH is faster at shallow call stacks and slower at deep call stacks. The implication being that the extra overhead is likely in the stack walking (and AppendExceptionStackFrame). I focused on profiling a code that throws an exception in 100-level deep call stack and then catches it. At the high level I see that some things take 30% more time but at the low level the building blocks are usually so fast that the running time is much smaller than the period of the sampling profiler and it's thus hard to measure accurately.

However, here are few observations:

  • Copying the CONTEXT structure from pCalleeContext to pCallerContext has measurable overhead. The structure is huge but we only track the integer+control registers so this is quite wasteful.
  • A runtime compiled WITHOUT FEATURE_EH_FUNCLETS but with REGDISPLAY structure changed to match the version with FEATURE_EH_FUNCLETS incurs significant overhead in some scenarios. While I didn't see much difference in the single threaded scenarios, in the multithreaded ones the performance was nearly twice as slow (using the "exceptions handled / second" metric for 10 threads continually throwing and catching exceptions over 100-deep call stack). I didn't study it in detail but I know that StackTraceInfo::AppendElement triggers a garbage collection quite often in this test case and overhead in stack walking contributes significantly to the stop-the-world time, so it's plausible that this could explain the regression.
  • With PR Replace CodeManState with cache in EECodeInfo for x86.Β #114170 we do roughly the same number of GC info decodings on FEATURE_EH_FUNCLETS runtime as the baseline one. Same goes for the number of stack unwind calls. This means we didn't introduce any structural error but unfortunately the individual calls are still so fast that I have trouble accurately measuring whether they are slower.
  • Making SfiNext + AppendExceptionStackFrame an FCall makes miniscule difference (~1% give or take) so it's unlikely that P/Invoke transitions are significant contributor to the cost.

@filipnavara
Copy link
Member Author

filipnavara commented Apr 9, 2025

Finally I got a profiler tool appropriate for the job. With Intel PIN it's possible to get relatively accurate instruction counts spent in different methods and diff it against the baseline: perf_diff.txt. It needs a bit of a manual analysis since the code paths between the two runtime builds are quite different.

Looks like the best bang for the buck optimizations to try are the following:

  • Use NOTIFY_ON_U2M_TRANSITIONS flag in SfiInit and invert the logic in SfiNext to detect unwinding into unmanaged frames. (Use NOTIFY_ON_U2M_TRANSITIONS in SfiInit/SfiNextΒ #114496)
  • Get PROCESS_EXPLICIT_FRAME_BEFORE_MANAGED_FRAME working with new EH. There's quite a few places that just assume caller context in REGDISPLAY is valid without checking.

@davidwrighton
Copy link
Member

davidwrighton commented Apr 11, 2025

I've managed to get 2 PRs out addressing performance issues.

PR #114580 adds a cache around the behavior of EECodeInfo::IsFunclet. I'm not entirely convinced adding ad hoc caches like this is worth it, but it is a nice performance win that should accrue to all platforms, although especially the Windows platforms.

PR #114582 adds a set of micro-optimizations found by examining the code generated for the Windows X86 platform. Some of these optimizations are isolated to the funclets work, but more of them are general performance improvements to this logic. Again, the optimizations are most important for Windows X86, but should have positive impacts on other platforms as well.

With both of these optimizations, I see 15%ish performance improvements, and with them combined I see performance getting within about 10% of the baseline .NET 9 scenario on Windows X86. If we are able to take both of them, I don't see a reason to hold this feature back based on performance. One detail to be aware of is that since we are changing logic significantly, the current set of profile guided optimization data for Windows X86 will be out of date when we turn this feature on, and it will take weeks to months before we have new data to fix that issue.

The issues that I plan to look at next week are the debugging scenarios, which unfortunately are mostly all only testable via internal test suites. I hope to find some time next week to take a look at them. Notably, I have significant concerns around the behavior of mixed mode stacks/throwing and catching across managed/native boundaries, and such. The contract between the runtime/debugger has historically been fairly fragile in this area, and changing EH models entirely may cause a large problem.

In addition, we'll need to bump the R2R version numbers so that old, non-funclet code won't be accepted by the new runtime, and the issues are the exception handler registrations need to be understood.

@filipnavara
Copy link
Member Author

I've managed to get 2 PRs out addressing performance issues.

Firstly, huge thanks for looking into the performance issues. There's some small overlap with the optimizations I was testing locally but the two PRs really move the needle. Along with #114496 which was merged yesterday we're looking at roughly +15% instruction count on funclets vs baseline (.NET 10 main) for deep stack exception propagation. There are some smaller micro-optimizations which can reduce that even further, probably worth pursuing but not essential. Notably, the performance actually increases significanly for exception propagation in shallow stack traces. Once all the PRs are reviewed and merged I can run the benchmarks and share some graphs / tables. Notably, #14170 also improved the scenario by a small bit on non-funclet runtime, so comparing to .NET 9 we are now within ~10%, as stated above. I agree with the sentiment that this seems like an acceptable trade-off since we get improved performance and code quality in many other areas and scenarios, including more common exception handling use cases.

The issues that I plan to look at next week are the debugging scenarios, which unfortunately are mostly all only testable via internal test suites.

Help on this is much appreciated! I only did very limited tests with netcoredbg to ensure I'm not completely breaking the world, but that doesn't cover any mixed native/managed exceptions.

In addition, we'll need to bump the R2R version numbers...

Right, I'll bolt that on to the commit in #113576 so we don't forget about it.

@filipnavara
Copy link
Member Author

Microbenchmarks from dotnet/performance

Tested with commit 199ae88 (main) and the x86funclets branch rebased on top.

Funclets are faster on all of them. None of the benchmarks excercise the pathological paths with deep stack traces. The "deep" stack trace benchmarks are only 10-level deep.

Method Job Branch kind Mean Error StdDev Median Min Max Ratio MannWhitney(1%) Gen0 Allocated Alloc Ratio
ThrowAndCatch Job-ICNAGN x86funclets Software 1.180 us 0.0077 us 0.0072 us 1.177 us 1.171 us 1.193 us 0.21 Faster 0.0328 172 B 1.00
ThrowAndCatch Job-JLGMJO main Software 5.500 us 0.0378 us 0.0353 us 5.494 us 5.441 us 5.570 us 1.00 Baseline 0.0216 172 B 1.00
ThrowAndCatch_ManyCatchBlocks Job-ICNAGN x86funclets Software 1.339 us 0.0037 us 0.0032 us 1.340 us 1.331 us 1.343 us 0.23 Faster 0.0322 172 B 1.00
ThrowAndCatch_ManyCatchBlocks Job-JLGMJO main Software 5.706 us 0.0277 us 0.0231 us 5.713 us 5.643 us 5.738 us 1.00 Baseline 0.0226 172 B 1.00
ThrowAndCatchFinally Job-ICNAGN x86funclets Software 1.183 us 0.0043 us 0.0040 us 1.182 us 1.178 us 1.191 us 0.22 Faster 0.0283 172 B 1.00
ThrowAndCatchFinally Job-JLGMJO main Software 5.442 us 0.0317 us 0.0296 us 5.436 us 5.391 us 5.495 us 1.00 Baseline 0.0215 172 B 1.00
ThrowAndCatchWhen Job-ICNAGN x86funclets Software 1.256 us 0.0064 us 0.0060 us 1.257 us 1.245 us 1.265 us 0.23 Faster 0.0301 172 B 1.00
ThrowAndCatchWhen Job-JLGMJO main Software 5.520 us 0.0274 us 0.0256 us 5.519 us 5.476 us 5.558 us 1.00 Baseline 0.0218 172 B 1.00
ThrowAndCatchWhenFinally Job-ICNAGN x86funclets Software 1.256 us 0.0067 us 0.0063 us 1.256 us 1.245 us 1.266 us 0.23 Faster 0.0298 172 B 1.00
ThrowAndCatchWhenFinally Job-JLGMJO main Software 5.556 us 0.0343 us 0.0321 us 5.548 us 5.505 us 5.619 us 1.00 Baseline 0.0218 172 B 1.00
ThrowAndCatchDeep Job-ICNAGN x86funclets Software 3.602 us 0.0131 us 0.0116 us 3.599 us 3.586 us 3.628 us 0.51 Faster 0.1282 692 B 1.00
ThrowAndCatchDeep Job-JLGMJO main Software 7.123 us 0.0362 us 0.0338 us 7.118 us 7.075 us 7.194 us 1.00 Baseline 0.1128 692 B 1.00
ThrowAndCatchDeepRecursive Job-ICNAGN x86funclets Software 3.831 us 0.0128 us 0.0120 us 3.830 us 3.814 us 3.853 us 0.53 Faster 0.1224 692 B 1.00
ThrowAndCatchDeepRecursive Job-JLGMJO main Software 7.283 us 0.0296 us 0.0277 us 7.276 us 7.235 us 7.326 us 1.00 Baseline 0.1160 692 B 1.00
MultipleNestedTryCatch_FirstCatches Job-ICNAGN x86funclets Software 1.178 us 0.0032 us 0.0028 us 1.179 us 1.173 us 1.184 us 0.22 Faster 0.0328 172 B 1.00
MultipleNestedTryCatch_FirstCatches Job-JLGMJO main Software 5.448 us 0.0175 us 0.0146 us 5.447 us 5.421 us 5.476 us 1.00 Baseline 0.0216 172 B 1.00
MultipleNestedTryCatch_LastCatches Job-ICNAGN x86funclets Software 1.294 us 0.0057 us 0.0053 us 1.294 us 1.286 us 1.303 us 0.23 Faster 0.0310 172 B 1.00
MultipleNestedTryCatch_LastCatches Job-JLGMJO main Software 5.648 us 0.0315 us 0.0294 us 5.659 us 5.604 us 5.688 us 1.00 Baseline 0.0222 172 B 1.00
MultipleNestedTryFinally Job-ICNAGN x86funclets Software 1.336 us 0.0066 us 0.0061 us 1.337 us 1.328 us 1.346 us 0.23 Faster 0.0324 172 B 1.00
MultipleNestedTryFinally Job-JLGMJO main Software 5.726 us 0.0433 us 0.0405 us 5.730 us 5.647 us 5.794 us 1.00 Baseline 0.0225 172 B 1.00
CatchAndRethrowDeep Job-ICNAGN x86funclets Software 15.519 us 0.0499 us 0.0417 us 15.522 us 15.433 us 15.583 us 0.26 Faster 0.1236 812 B 1.00
CatchAndRethrowDeep Job-JLGMJO main Software 59.260 us 0.4628 us 0.4329 us 59.243 us 58.495 us 59.876 us 1.00 Baseline - 812 B 1.00
CatchAndThrowOtherDeep Job-ICNAGN x86funclets Software 19.625 us 0.1129 us 0.1056 us 19.621 us 19.466 us 19.837 us 0.32 Faster 0.3129 1892 B 1.00
CatchAndThrowOtherDeep Job-JLGMJO main Software 61.255 us 0.3391 us 0.3172 us 61.209 us 60.603 us 61.619 us 1.00 Baseline 0.2413 1892 B 1.00
TryAndFinallyDeep Job-ICNAGN x86funclets Software 4.918 us 0.0231 us 0.0216 us 4.919 us 4.875 us 4.964 us 0.60 Faster 0.1180 692 B 1.00
TryAndFinallyDeep Job-JLGMJO main Software 8.200 us 0.0668 us 0.0592 us 8.185 us 8.142 us 8.341 us 1.00 Baseline 0.1303 692 B 1.00
TryAndCatchDeep_CaugtAtTheTop Job-ICNAGN x86funclets Software 4.729 us 0.0242 us 0.0189 us 4.730 us 4.699 us 4.757 us 0.61 Faster 0.1306 692 B 1.00
TryAndCatchDeep_CaugtAtTheTop Job-JLGMJO main Software 7.791 us 0.0450 us 0.0399 us 7.777 us 7.736 us 7.882 us 1.00 Baseline 0.1225 692 B 1.00
ThrowAndCatch Job-ICNAGN x86funclets Hardware 3.585 us 0.0180 us 0.0168 us 3.584 us 3.560 us 3.618 us 0.63 Faster 0.0282 172 B 1.00
ThrowAndCatch Job-JLGMJO main Hardware 5.662 us 0.0184 us 0.0172 us 5.658 us 5.639 us 5.701 us 1.00 Baseline 0.0222 172 B 1.00
ThrowAndCatch_ManyCatchBlocks Job-ICNAGN x86funclets Hardware 3.730 us 0.0140 us 0.0124 us 3.728 us 3.707 us 3.756 us 0.63 Faster 0.0295 172 B 1.00
ThrowAndCatch_ManyCatchBlocks Job-JLGMJO main Hardware 5.939 us 0.0181 us 0.0151 us 5.935 us 5.924 us 5.972 us 1.00 Baseline 0.0235 172 B 1.00
ThrowAndCatchFinally Job-ICNAGN x86funclets Hardware 3.552 us 0.0134 us 0.0125 us 3.552 us 3.532 us 3.580 us 0.63 Faster 0.0279 172 B 1.00
ThrowAndCatchFinally Job-JLGMJO main Hardware 5.630 us 0.0364 us 0.0341 us 5.628 us 5.587 us 5.698 us 1.00 Baseline 0.0221 172 B 1.00
ThrowAndCatchWhen Job-ICNAGN x86funclets Hardware 3.623 us 0.0187 us 0.0166 us 3.625 us 3.587 us 3.647 us 0.63 Faster 0.0286 172 B 1.00
ThrowAndCatchWhen Job-JLGMJO main Hardware 5.712 us 0.0395 us 0.0370 us 5.733 us 5.657 us 5.762 us 1.00 Baseline 0.0224 172 B 1.00
ThrowAndCatchWhenFinally Job-ICNAGN x86funclets Hardware 3.639 us 0.0178 us 0.0166 us 3.634 us 3.622 us 3.670 us 0.63 Faster 0.0288 172 B 1.00
ThrowAndCatchWhenFinally Job-JLGMJO main Hardware 5.735 us 0.0187 us 0.0175 us 5.733 us 5.703 us 5.765 us 1.00 Baseline 0.0225 172 B 1.00
ThrowAndCatchDeep Job-ICNAGN x86funclets Hardware 6.005 us 0.0236 us 0.0220 us 5.999 us 5.976 us 6.045 us 0.82 Faster 0.1199 692 B 1.00
ThrowAndCatchDeep Job-JLGMJO main Hardware 7.301 us 0.0300 us 0.0280 us 7.306 us 7.245 us 7.334 us 1.00 Baseline 0.1150 692 B 1.00
ThrowAndCatchDeepRecursive Job-ICNAGN x86funclets Hardware 6.260 us 0.0227 us 0.0212 us 6.259 us 6.219 us 6.305 us 0.84 Faster 0.1235 692 B 1.00
ThrowAndCatchDeepRecursive Job-JLGMJO main Hardware 7.428 us 0.0245 us 0.0229 us 7.433 us 7.388 us 7.464 us 1.00 Baseline 0.1175 692 B 1.00
MultipleNestedTryCatch_FirstCatches Job-ICNAGN x86funclets Hardware 3.556 us 0.0186 us 0.0174 us 3.555 us 3.528 us 3.582 us 0.63 Faster 0.0280 172 B 1.00
MultipleNestedTryCatch_FirstCatches Job-JLGMJO main Hardware 5.646 us 0.0182 us 0.0170 us 5.650 us 5.617 us 5.673 us 1.00 Baseline 0.0221 172 B 1.00
MultipleNestedTryCatch_LastCatches Job-ICNAGN x86funclets Hardware 3.687 us 0.0172 us 0.0160 us 3.679 us 3.670 us 3.714 us 0.63 Faster 0.0291 172 B 1.00
MultipleNestedTryCatch_LastCatches Job-JLGMJO main Hardware 5.879 us 0.0225 us 0.0210 us 5.880 us 5.833 us 5.910 us 1.00 Baseline 0.0231 172 B 1.00
MultipleNestedTryFinally Job-ICNAGN x86funclets Hardware 3.683 us 0.0197 us 0.0184 us 3.680 us 3.652 us 3.712 us 0.62 Faster 0.0291 172 B 1.00
MultipleNestedTryFinally Job-JLGMJO main Hardware 5.920 us 0.0228 us 0.0202 us 5.918 us 5.884 us 5.961 us 1.00 Baseline 0.0232 172 B 1.00
CatchAndRethrowDeep Job-ICNAGN x86funclets Hardware 18.051 us 0.0764 us 0.0715 us 18.053 us 17.930 us 18.172 us 0.30 Faster 0.1435 812 B 1.00
CatchAndRethrowDeep Job-JLGMJO main Hardware 60.568 us 0.1999 us 0.1772 us 60.599 us 60.311 us 60.777 us 1.00 Baseline - 812 B 1.00
CatchAndThrowOtherDeep Job-ICNAGN x86funclets Hardware 21.770 us 0.0543 us 0.0508 us 21.767 us 21.690 us 21.865 us 0.35 Faster 0.3463 1892 B 1.00
CatchAndThrowOtherDeep Job-JLGMJO main Hardware 61.553 us 0.3008 us 0.2511 us 61.467 us 61.262 us 62.120 us 1.00 Baseline 0.2441 1892 B 1.00
TryAndFinallyDeep Job-ICNAGN x86funclets Hardware 7.182 us 0.0257 us 0.0228 us 7.181 us 7.133 us 7.218 us 0.85 Faster 0.1141 692 B 1.00
TryAndFinallyDeep Job-JLGMJO main Hardware 8.489 us 0.0238 us 0.0223 us 8.490 us 8.447 us 8.529 us 1.00 Baseline 0.1005 692 B 1.00
TryAndCatchDeep_CaugtAtTheTop Job-ICNAGN x86funclets Hardware 7.105 us 0.0306 us 0.0271 us 7.103 us 7.072 us 7.158 us 0.88 Faster 0.1123 692 B 1.00
TryAndCatchDeep_CaugtAtTheTop Job-JLGMJO main Hardware 8.079 us 0.0295 us 0.0276 us 8.079 us 8.036 us 8.127 us 1.00 Baseline 0.1270 692 B 1.00
ThrowAndCatch Job-ICNAGN x86funclets ReflectionSoftware 6.111 us 0.0294 us 0.0261 us 6.115 us 6.047 us 6.155 us 0.52 Faster 0.0969 540 B 1.00
ThrowAndCatch Job-JLGMJO main ReflectionSoftware 11.742 us 0.0505 us 0.0448 us 11.739 us 11.646 us 11.827 us 1.00 Baseline 0.0923 540 B 1.00
ThrowAndCatch_ManyCatchBlocks Job-ICNAGN x86funclets ReflectionSoftware 6.191 us 0.0355 us 0.0332 us 6.177 us 6.128 us 6.239 us 0.52 Faster 0.0989 540 B 1.00
ThrowAndCatch_ManyCatchBlocks Job-JLGMJO main ReflectionSoftware 11.958 us 0.0711 us 0.0630 us 11.932 us 11.886 us 12.102 us 1.00 Baseline 0.0946 540 B 1.00
ThrowAndCatchDeep Job-ICNAGN x86funclets ReflectionSoftware 8.399 us 0.0775 us 0.0725 us 8.366 us 8.315 us 8.589 us 0.63 Faster 0.1677 972 B 1.00
ThrowAndCatchDeep Job-JLGMJO main ReflectionSoftware 13.331 us 0.0576 us 0.0538 us 13.326 us 13.231 us 13.412 us 1.00 Baseline 0.1585 972 B 1.00
ThrowAndCatchDeepRecursive Job-ICNAGN x86funclets ReflectionSoftware 8.664 us 0.0288 us 0.0270 us 8.671 us 8.598 us 8.694 us 0.64 Faster 0.1738 972 B 1.00
ThrowAndCatchDeepRecursive Job-JLGMJO main ReflectionSoftware 13.483 us 0.0682 us 0.0638 us 13.472 us 13.394 us 13.611 us 1.00 Baseline 0.1593 972 B 1.00
ThrowAndCatch Job-ICNAGN x86funclets ReflectionHardware 8.433 us 0.0378 us 0.0335 us 8.432 us 8.373 us 8.496 us 0.71 Faster 0.1005 540 B 1.00
ThrowAndCatch Job-JLGMJO main ReflectionHardware 11.957 us 0.0591 us 0.0553 us 11.938 us 11.892 us 12.069 us 1.00 Baseline 0.0945 540 B 1.00
ThrowAndCatch_ManyCatchBlocks Job-ICNAGN x86funclets ReflectionHardware 8.597 us 0.0356 us 0.0333 us 8.588 us 8.557 us 8.664 us 0.70 Faster 0.1020 540 B 1.00
ThrowAndCatch_ManyCatchBlocks Job-JLGMJO main ReflectionHardware 12.307 us 0.0683 us 0.0639 us 12.303 us 12.213 us 12.422 us 1.00 Baseline 0.0968 540 B 1.00
ThrowAndCatchDeep Job-ICNAGN x86funclets ReflectionHardware 10.828 us 0.0385 us 0.0360 us 10.824 us 10.737 us 10.881 us 0.80 Faster 0.1728 972 B 1.00
ThrowAndCatchDeep Job-JLGMJO main ReflectionHardware 13.512 us 0.0582 us 0.0544 us 13.521 us 13.404 us 13.587 us 1.00 Baseline 0.1612 972 B 1.00
ThrowAndCatchDeepRecursive Job-ICNAGN x86funclets ReflectionHardware 11.031 us 0.0435 us 0.0363 us 11.036 us 10.965 us 11.097 us 0.80 Faster 0.1743 972 B 1.00
ThrowAndCatchDeepRecursive Job-JLGMJO main ReflectionHardware 13.704 us 0.0720 us 0.0673 us 13.713 us 13.596 us 13.807 us 1.00 Baseline 0.1623 972 B 1.00

@filipnavara
Copy link
Member Author

Deep stack trace benchmark

Source code
    static int exceptionsHandled = 0;

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CallMe(int i, NotImplementedException inEx)
    {
        if (i == 0)
        {
            throw inEx;
        }

        CallMe(i - 1, inEx);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CatchMe(NotImplementedException inEx, int depth = 100)
    {
        try
        {
            CallMe(depth, inEx);
        }
        catch (NotImplementedException)
        {
            Interlocked.Increment(ref exceptionsHandled);
        }
    }

    [Params(1, 10, 100, 1000)]
    public int Depth;

    [Benchmark]
    public void CatchMe1000x100()
    {
        CatchMe(new NotImplementedException(), Depth);
    }    

Main:

Method Depth Mean Error StdDev
CatchMe1000x100 1 5.636 us 0.0139 us 0.0130 us
CatchMe1000x100 10 7.029 us 0.0163 us 0.0145 us
CatchMe1000x100 100 20.581 us 0.0601 us 0.0562 us
CatchMe1000x100 1000 153.568 us 0.4248 us 0.3974 us

Funclets:

Method Depth Mean Error StdDev
CatchMe1000x100 1 1.488 us 0.0035 us 0.0033 us
CatchMe1000x100 10 3.657 us 0.0115 us 0.0102 us
CatchMe1000x100 100 23.606 us 0.1366 us 0.1211 us
CatchMe1000x100 1000 220.143 us 0.7091 us 0.6633 us

@filipnavara
Copy link
Member Author

filipnavara commented Apr 24, 2025

Handled exceptions/second metric

This is essentially the same non-scientific benchmark shared in #113985 (comment). We run 10 threads, each of them is throwing an exception at deep stack trace and catching it at the top of the stack trace.

This is showing pretty much the worst case scenario (at depth 100) and also the improvements for the more common scenarios. It also shows that we don't introduce any significant bottleneck by taking some additional lock.

Source code
using System.Diagnostics;
using System.Runtime.CompilerServices;

public class Program
{
    static int exceptionsHandled = 0;

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CallMe(int i, NotImplementedException inEx)
    {
        if (i == 0)
        {
            throw inEx;
        }

        CallMe(i - 1, inEx);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CatchMe(NotImplementedException inEx, int depth = 100)
    {
        try
        {
            CallMe(depth, inEx);
        }
        catch (NotImplementedException)
        {
            Interlocked.Increment(ref exceptionsHandled);
        }
    }

    private static void ThreadEntrypoint()
    {
        while (true)
        {
            NotImplementedException inEx = new NotImplementedException();
            CatchMe(inEx, 100);
        }
    }

    public static void Main(string[] args)
    {
        exceptionsHandled = 0; 
    
        Stopwatch sw = Stopwatch.StartNew();
        int savedExceptionsHandled = 0;
        int maxThreads = action == "st" ? 1 : 10;
        for (int i = 0; i < maxThreads; i++)
        {
            new Thread(ThreadEntrypoint).Start();
        }
        Thread.Sleep(1000);
        savedExceptionsHandled = Volatile.Read(ref exceptionsHandled);
        sw.Stop();
        Console.WriteLine($"Exceptions handled: {savedExceptionsHandled}");
        Console.WriteLine($"Time: {sw.Elapsed}");
        Console.WriteLine($"Normalized: {(int)(savedExceptionsHandled * 1000 / sw.ElapsedMilliseconds)}");
        Environment.Exit(0);
    }
}

Depth: 100

Main: 386,344 ex/sec
.NET 9: 378,816 ex/sec
Funclets: 343,814 ex/sec

Depth: 50

Main: 599,575 ex/sec
Funclets: 646,922 ex/sec

Depth: 10

Main: 931,671 ex/sec
Funclets: 1,645,489 ex/sec

@am11
Copy link
Member

am11 commented May 2, 2025

The issues that I plan to look at next week are the debugging scenarios, which unfortunately are mostly all only testable via internal test suites. I hope to find some time next week to take a look at them. Notably, I have significant concerns around the behavior of mixed mode stacks/throwing and catching across managed/native boundaries, and such. The contract between the runtime/debugger has historically been fairly fragile in this area, and changing EH models entirely may cause a large problem.

@davidwrighton, would it be possible to get this validated in the current preview time-frame? It will allow us to remove a few more HMFs from jithelpers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-ExceptionHandling-coreclr untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

4 participants