Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented May 24, 2023

Stack walks for GC purposes require that EE is suspended and thus contribute directly to GC pauses which is especially noticed when workstation GC is used.

A few changes here to cache the looked up unwind/method infos, as stack walks are fairly repetitive and often look up for the same methods infos.

@VSadov VSadov requested a review from MichalStrehovsky as a code owner May 24, 2023 06:52
@ghost ghost assigned VSadov May 24, 2023
@ghost
Copy link

ghost commented May 24, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Stack walks for GC purposes require that EE is suspended and thus contribute directly to GC pauses which is especially noticed when workstation GC is used.

A few changes here to either

  • reduce redundant work, such as unwinding floating point registers which never contain GC roots, or
  • cache the looked up unwind/method infos, as stack walks are fairly repetitive and often look up for the same methods infos.
Author: VSadov
Assignees: VSadov
Labels:

area-NativeAOT-coreclr

Milestone: -

}
else
{
bool retVal = uc.getInfoFromDwarfSection(pc, uwInfoSections, dwarfOffsetHint);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getInfoFromDwarfSection has very similar cache here: https://github.com/dotnet/runtime/blob/main/src/native/external/llvm-libunwind/src/UnwindCursor.hpp#L1682

Is this cache not effective in our scenarios? Should we short-circuit it if it is not helping?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or substitute it with your cache.)

Copy link
Member Author

@VSadov VSadov May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a lower level cache that helps with finding FDE location, but then there is still some parsing and fetching more info. Also the FDA cache considers a possibility of unloading and thus uses locks.

Our cache only deals with managed frames and does not need to be concerned with unloading. Even if we support unloading one day, we'd probably just suspend runtime and flush the cache when something unloads, because we can.

I think we could short-circuit this cache for our scenario, in theory, but it may be better to keep it and just treat it as an L2 cache that helps with misses in our, faster, but smaller cache.

Edit: actually, I do not think we use this cache at the moment. Not on Linux.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am reading the code correctly, the libunwind cache has no upper limit for growth. In the fullness of time, it can have an entry for every method of the program. I do not think we want that behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can have an entry for every method of the program

And it seems to be doing a linear lookup in array of these entries: https://github.com/dotnet/runtime/blob/main/src/native/external/llvm-libunwind/src/UnwindCursor.hpp#L154-L159 . The linear lookup is going be slower than binary searching for the right entry once it grows enough.

Copy link
Member Author

@VSadov VSadov May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done some experiments to see:

  • is there a benefit from caching in server GC or is there a penalty?
  • is there a performance cliff if we get into thrashing due to vastly bigger number of active frames or bad hashing?

I used the Stage2 ASP.Net app as something that can be considered a realworld - it does some allocations, but not gratuitously.

Copy link
Member Author

@VSadov VSadov May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the server GC there is not much of benefit or a penalty from the cache when scenario runs as-is. I think the part that stack walking is distributed onto more threads helps, but the main factor is that server GC allows a lot bigger heap and thus needs to collect with much less frequency. There are some factors that may grow with the heap size, but stackwalking stays the same and as it is performed less frequently there is less impact.

If I force server GC into collecting more frequently by reducing the size of gen0 via DOTNET_GCgen0size, I see that the effect from caching is positive.

Server GC scenario, no caching:

| Requests/sec           | 229,453   |
| Requests               | 3,464,648 |
| Mean latency (ms)      | 1.18      |
| Max latency (ms)       | 35.29     |

== gen0: 0x400000
| Requests/sec           | 219,995   |
| Requests               | 3,321,876 |
| Mean latency (ms)      | 1.26      |
| Max latency (ms)       | 24.87     |

== gen0: 0x100000

| Requests/sec           | 179,919   |
| Requests               | 2,716,712 |
| Mean latency (ms)      | 4.38      |
| Max latency (ms)       | 268.62    |

Same with caching:

| Requests/sec           | 231,501   |
| Requests               | 3,495,577 |
| Mean latency (ms)      | 1.16      |
| Max latency (ms)       | 28.64     |

== gen0: 0x400000
| Requests/sec           | 221,480   |
| Requests               | 3,344,248 |
| Mean latency (ms)      | 1.25      |
| Max latency (ms)       | 32.28     |

== gen0: 0x100000
| Requests/sec           | 184,528   |
| Requests               | 2,786,302 |
| Mean latency (ms)      | 4.02      |
| Max latency (ms)       | 260.77    |

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To investigate what happens if cache is too small or experiences a lot of collisions for other reasons, I built a variant that hashes everything into element #1. Effectively have a cache of size=1.

What I see is that the worst case scenario performs roughly the same as if there is no cache or slightly worse when GC is churning. (1Mb gen0 is very small)

With 1-element cache:

| Requests/sec           | 230,149   |
| Requests               | 3,475,149 |
| Mean latency (ms)      | 1.18      |
| Max latency (ms)       | 34.82     |

== gen0: 0x400000
| Requests/sec           | 220,356   |
| Requests               | 3,327,257 |
| Mean latency (ms)      | 1.27      |
| Max latency (ms)       | 34.98     |

== gen0: 0x100000
| Requests/sec           | 176,887   |
| Requests               | 2,670,928 |
| Mean latency (ms)      | 4.24      |
| Max latency (ms)       | 272.22    |

Copy link
Member Author

@VSadov VSadov May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I see happening in the worst case is that we have a lot of misses, and the cost of a miss is an interlocked exchange (+ useless fetch of a cache line, pointless transfer of ownership, etc..), but overall seems not too bad next to the cost of searching/parsing/constructing the unw_proc_info_t.

We may also have some cache hits even with 1-element cache. Two threads may go through the same sequence of frames or even the same thread ask for the same method info - we may need the same info to initiate the stackwalk and to do the step and to do things like IsSafePoint, IsFunclet. Such hits may be helping a bit even in 1-element cache case.
I even thought about adding a threadlocal 1-element cache, but I think it is unnecessary complexity and unlikely to add much on top of N-element cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run a number of experiments to understand the cache behavior - workstation/server GC, varying number of cores, and varying number of stack frames. https://gist.github.com/jkotas/6ddad090964d9cc35c6ccf4d71272e67 is the base of my benchmarks for micro-benchmarks. I agree that the cache helps a lot for the simple case, but it can hurt for larger number of cores. There were visible regressions on my 20 core Intel desktop with the server GC and 20 GC cores/GC heaps.

The unwind info lookup and decoding leaves a lot of perf on the table. I think we should fix the unwind info lookup and decoding perf first, that is going to be unconditional improvement, before evaluating benefits of a cache like this one. I have done a quick hack to copy the Windows implementation of binary search to Unix and it gave back 40%-50% of the cache benefit in the single thread microbenchmarks: https://gist.github.com/jkotas/6ddad090964d9cc35c6ccf4d71272e67 . I will try to get the numbers for the Todo app too.

@VSadov
Copy link
Member Author

VSadov commented May 25, 2023

I've separated the llvm change into a separate commit and added a reference to the commit tracking file.
CC: @am11

@VSadov
Copy link
Member Author

VSadov commented May 31, 2023

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented Jul 31, 2023

My sense here is that there is not a lot of interest currently in making stackwalks faster via caching/reusing what we learned in previous stackwalks. There is no need to keep this PR open. We can revisit later if needed.

@VSadov VSadov closed this Jul 31, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Aug 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants