[NativeAOT] A few changes to make stack walks a bit faster #86690

VSadov · 2023-05-24T06:52:09Z

Stack walks for GC purposes require that EE is suspended and thus contribute directly to GC pauses which is especially noticed when workstation GC is used.

A few changes here to cache the looked up unwind/method infos, as stack walks are fairly repetitive and often look up for the same methods infos.

ghost · 2023-05-24T06:52:23Z

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Stack walks for GC purposes require that EE is suspended and thus contribute directly to GC pauses which is especially noticed when workstation GC is used.

A few changes here to either

reduce redundant work, such as unwinding floating point registers which never contain GC roots, or
cache the looked up unwind/method infos, as stack walks are fairly repetitive and often look up for the same methods infos.

Author:	VSadov
Assignees:	VSadov
Labels:	`area-NativeAOT-coreclr`
Milestone:	-

src/coreclr/nativeaot/Runtime/unix/UnwindHelpers.cpp

jkotas · 2023-05-25T04:28:40Z

src/coreclr/nativeaot/Runtime/unix/UnwindHelpers.cpp

    }
+    else
+    {
+        bool retVal = uc.getInfoFromDwarfSection(pc, uwInfoSections, dwarfOffsetHint);


getInfoFromDwarfSection has very similar cache here: https://github.com/dotnet/runtime/blob/main/src/native/external/llvm-libunwind/src/UnwindCursor.hpp#L1682

Is this cache not effective in our scenarios? Should we short-circuit it if it is not helping?

(Or substitute it with your cache.)

That is a lower level cache that helps with finding FDE location, but then there is still some parsing and fetching more info. Also the FDA cache considers a possibility of unloading and thus uses locks.

Our cache only deals with managed frames and does not need to be concerned with unloading. Even if we support unloading one day, we'd probably just suspend runtime and flush the cache when something unloads, because we can.

I think we could short-circuit this cache for our scenario, in theory, but it may be better to keep it and just treat it as an L2 cache that helps with misses in our, faster, but smaller cache.

Edit: actually, I do not think we use this cache at the moment. Not on Linux.

If I am reading the code correctly, the libunwind cache has no upper limit for growth. In the fullness of time, it can have an entry for every method of the program. I do not think we want that behavior.

it can have an entry for every method of the program

And it seems to be doing a linear lookup in array of these entries: https://github.com/dotnet/runtime/blob/main/src/native/external/llvm-libunwind/src/UnwindCursor.hpp#L154-L159 . The linear lookup is going be slower than binary searching for the right entry once it grows enough.

I have done some experiments to see:

is there a benefit from caching in server GC or is there a penalty?

is there a performance cliff if we get into thrashing due to vastly bigger number of active frames or bad hashing?

I used the Stage2 ASP.Net app as something that can be considered a realworld - it does some allocations, but not gratuitously.

For the server GC there is not much of benefit or a penalty from the cache when scenario runs as-is. I think the part that stack walking is distributed onto more threads helps, but the main factor is that server GC allows a lot bigger heap and thus needs to collect with much less frequency. There are some factors that may grow with the heap size, but stackwalking stays the same and as it is performed less frequently there is less impact.

If I force server GC into collecting more frequently by reducing the size of gen0 via DOTNET_GCgen0size, I see that the effect from caching is positive.

Server GC scenario, no caching:

| Requests/sec | 229,453 | | Requests | 3,464,648 | | Mean latency (ms) | 1.18 | | Max latency (ms) | 35.29 | == gen0: 0x400000 | Requests/sec | 219,995 | | Requests | 3,321,876 | | Mean latency (ms) | 1.26 | | Max latency (ms) | 24.87 | == gen0: 0x100000 | Requests/sec | 179,919 | | Requests | 2,716,712 | | Mean latency (ms) | 4.38 | | Max latency (ms) | 268.62 |

Same with caching:

| Requests/sec | 231,501 | | Requests | 3,495,577 | | Mean latency (ms) | 1.16 | | Max latency (ms) | 28.64 | == gen0: 0x400000 | Requests/sec | 221,480 | | Requests | 3,344,248 | | Mean latency (ms) | 1.25 | | Max latency (ms) | 32.28 | == gen0: 0x100000 | Requests/sec | 184,528 | | Requests | 2,786,302 | | Mean latency (ms) | 4.02 | | Max latency (ms) | 260.77 |

To investigate what happens if cache is too small or experiences a lot of collisions for other reasons, I built a variant that hashes everything into element #1. Effectively have a cache of size=1.

What I see is that the worst case scenario performs roughly the same as if there is no cache or slightly worse when GC is churning. (1Mb gen0 is very small)

With 1-element cache:

| Requests/sec | 230,149 | | Requests | 3,475,149 | | Mean latency (ms) | 1.18 | | Max latency (ms) | 34.82 | == gen0: 0x400000 | Requests/sec | 220,356 | | Requests | 3,327,257 | | Mean latency (ms) | 1.27 | | Max latency (ms) | 34.98 | == gen0: 0x100000 | Requests/sec | 176,887 | | Requests | 2,670,928 | | Mean latency (ms) | 4.24 | | Max latency (ms) | 272.22 |

What I see happening in the worst case is that we have a lot of misses, and the cost of a miss is an interlocked exchange (+ useless fetch of a cache line, pointless transfer of ownership, etc..), but overall seems not too bad next to the cost of searching/parsing/constructing the unw_proc_info_t.

We may also have some cache hits even with 1-element cache. Two threads may go through the same sequence of frames or even the same thread ask for the same method info - we may need the same info to initiate the stackwalk and to do the step and to do things like IsSafePoint, IsFunclet. Such hits may be helping a bit even in 1-element cache case.
I even thought about adding a threadlocal 1-element cache, but I think it is unnecessary complexity and unlikely to add much on top of N-element cache.

I have run a number of experiments to understand the cache behavior - workstation/server GC, varying number of cores, and varying number of stack frames. https://gist.github.com/jkotas/6ddad090964d9cc35c6ccf4d71272e67 is the base of my benchmarks for micro-benchmarks. I agree that the cache helps a lot for the simple case, but it can hurt for larger number of cores. There were visible regressions on my 20 core Intel desktop with the server GC and 20 GC cores/GC heaps.

The unwind info lookup and decoding leaves a lot of perf on the table. I think we should fix the unwind info lookup and decoding perf first, that is going to be unconditional improvement, before evaluating benefits of a cache like this one. I have done a quick hack to copy the Windows implementation of binary search to Unix and it gave back 40%-50% of the cache benefit in the single thread microbenchmarks: https://gist.github.com/jkotas/6ddad090964d9cc35c6ccf4d71272e67 . I will try to get the numbers for the Todo app too.

VSadov · 2023-05-25T17:54:52Z

I've separated the llvm change into a separate commit and added a reference to the commit tracking file.
CC: @am11

src/coreclr/nativeaot/Runtime/unix/UnwindHelpers.cpp

VSadov · 2023-05-31T02:55:59Z

/azp run runtime-extra-platforms

azure-pipelines · 2023-05-31T02:56:11Z

Azure Pipelines successfully started running 1 pipeline(s).

VSadov · 2023-07-31T06:39:35Z

My sense here is that there is not a lot of interest currently in making stackwalks faster via caching/reusing what we learned in previous stackwalks. There is no need to keep this PR open. We can revisit later if needed.

VSadov requested a review from MichalStrehovsky as a code owner May 24, 2023 06:52

ghost assigned VSadov May 24, 2023

ghost added the area-NativeAOT-coreclr label May 24, 2023

jkotas reviewed May 24, 2023

View reviewed changes

src/coreclr/nativeaot/Runtime/unix/UnwindHelpers.cpp Show resolved Hide resolved

jkotas reviewed May 25, 2023

View reviewed changes

VSadov force-pushed the aotPrf branch from 8dc95f2 to febc8a8 Compare May 25, 2023 17:52

VSadov mentioned this pull request May 30, 2023

[NativeAOT] A few cleanups in stack walking on Windows #86917

Merged

VSadov added 3 commits May 30, 2023 13:58

Add UnwindCursor::setinfo to be able to set it with cached data

082ba26

add a llvm commit to the tracking list

2c44b7c

procInfo lookup cache (Unix)

ba6ae80

VSadov force-pushed the aotPrf branch from febc8a8 to ba6ae80 Compare May 30, 2023 21:00

make ProcInfoCacheEntry more compact

24ed01e

VSadov commented May 31, 2023

View reviewed changes

src/coreclr/nativeaot/Runtime/unix/UnwindHelpers.cpp Outdated Show resolved Hide resolved

do not cache personality routine, it is not used in our scenarios

e5a6b91

dotnet deleted a comment from azure-pipelines bot May 31, 2023

build-analysis bot mentioned this pull request May 31, 2023

Roslyn source generator crash on mono/linux/arm64 #81123

Closed

jkotas mentioned this pull request Jun 12, 2023

[NativeAOT] Avoid redundant unwind info lookup during stackwalks #87419

Merged

VSadov closed this Jul 31, 2023

ghost locked as resolved and limited conversation to collaborators Aug 30, 2023

[NativeAOT] A few changes to make stack walks a bit faster #86690

[NativeAOT] A few changes to make stack walks a bit faster #86690

Uh oh!

Conversation

VSadov commented May 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented May 24, 2023

Uh oh!

Uh oh!

jkotas May 25, 2023

Choose a reason for hiding this comment

Uh oh!

jkotas May 25, 2023

Choose a reason for hiding this comment

Uh oh!

VSadov May 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkotas May 25, 2023

Choose a reason for hiding this comment

Uh oh!

jkotas May 25, 2023

Choose a reason for hiding this comment

Uh oh!

VSadov May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VSadov May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VSadov May 31, 2023

Choose a reason for hiding this comment

Uh oh!

VSadov May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkotas Jun 7, 2023

Choose a reason for hiding this comment

Uh oh!

VSadov commented May 25, 2023

Uh oh!

Uh oh!

VSadov commented May 31, 2023

Uh oh!

azure-pipelines bot commented May 31, 2023

Uh oh!

VSadov commented Jul 31, 2023

Uh oh!

Uh oh!

VSadov commented May 24, 2023 •

edited

Loading

VSadov May 25, 2023 •

edited

Loading

VSadov May 30, 2023 •

edited

Loading

VSadov May 31, 2023 •

edited

Loading

VSadov May 31, 2023 •

edited

Loading