Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@lambdageek
Copy link
Member

Related to #21009.

There are two scenarios:

  1. When someone force quits Mono (or just runs kill -TERM <pid>), the process can receive the signal on any thread,
  2. or - if a thread in the process crashes, but the thread is not attached to the runtime, Mono's signal handlers still run.

The crash reporter assumes that the crashing thread is either attached to the runtime, or at least mono_thread_info_current or the JIT TLS data are set for the thread. If the thread is truly foreign and it never interacted with Mono, and it crashes, both of those assumptions are false, but Mono's crash reporter signal handlers still run.

The solution from this PR is: if crash reporting is enabled, start a dedicated thread at process startup that is a "crash report leader" - when a crash happens, the crashing thread (the crash originator) wakes the leader, and the leader collects the crash report. The crash originator does not do any work that requires being attached to the runtime or to the JIT such as iterating over thread IDs or stack walking.

lambdageek added 14 commits June 7, 2021 23:53
At process startup, start a separate thread that is attached to the runtime and
can collect crash reports.  Crashing threads will wake it and wait for it to
collect the crash reports
We need to coordinate the originator and the leader in a few places.

The leader needs to pause to after collecting the thread ids before suspending
the non-originator threads, and again while the originator is dumping its own
stack.

The originator needs to wait for the leader to collect the thread IDs and to
tell it its assigned slot. Then it tells the leader to suspend the others,
dumps its own memory, then tell the leader to dump the whole crash report and
wait for it to reply when it's done.
either because the crash leader crashed, or because the process got a SIGTERM
and it arrived on the crash leader thread
@lambdageek lambdageek force-pushed the merp-dedicated-thread branch from e3555c5 to 21e65ec Compare June 8, 2021 04:05
@marek-safar marek-safar requested a review from vargaz June 8, 2021 05:22
@lambdageek
Copy link
Member Author

The "let's have a crash in dlopen" test is failing. because we call dladdr to get the symbols for the stack traces, and it's deadlocking with the crashed thread which is holding the lock internal to dyld:

We're quite stuck here:

  1. To get the managed and unmanaged stacks we need to call dladdr in order to get the module and symbol addresses for each frame;
  2. If the crash originator crashed in dyld, it may hold the underlying lock.

So the question is: Do we think it is likely that important crashes are happening because corrupted arguments are passed to dlopen? Or can we disable this test (and give up on this potential class of issues for now)

Sample stack trace of a deadlock
* thread #1, name = 'tid_103', queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff20408cce libsystem_kernel.dylib`read + 10
    frame #1: 0x000000010f2fbf27 mono`mono_threads_summarize_execute_internal at threads.c:6694:13 [opt]
    frame #2: 0x000000010f2fbf0a mono`mono_threads_summarize_execute_internal(ctx=, out=0x00007ffee0b51a90, hashes=0x00007ffee0b51aa0, silent=, working_mem="", provided_size=10000000, this_thread_controls=1) at threads.c:7329 [opt]
    frame #3: 0x000000010f2fc1e9 mono`mono_threads_summarize(ctx=0x00007fb783824cb0, out=, hashes=, silent=0, signal_handler_controller=, mem=, provided_size=10000000) at threads.c:7410:15 [opt]
    frame #4: 0x000000010f1c2504 mono`mono_dump_native_crash_info at mini-posix.c:987:13 [opt]
    frame #5: 0x000000010f1c22f2 mono`mono_dump_native_crash_info(signal="segv", mctx=0x00007fb783824cb0, info=) at mini-posix.c:1105 [opt]
    frame #6: 0x000000010f15e48e mono`mono_handle_native_crash(signal="segv", mctx=0x00007fb783824cb0, info=0x0000000000000000) at mini-exceptions.c:3468:2 [opt]
    frame #7: 0x000000010f3c92a7 mono`altstack_handle_and_restore.cold.1 at exceptions-amd64.c:880:4 [opt]
    frame #8: 0x000000010f1bcfbe mono`altstack_handle_and_restore(ctx=, obj=, flags=) at exceptions-amd64.c:879:7 [opt]
    frame #9: 0x000000011ee53b7c dyld`_platform_strchr$VARIANT$Base + 28
    frame #10: 0x000000011ee0d954 dyld`dlopen_internal + 206
    frame #11: 0x00007fff2045acb4 libdyld.dylib`dlopen_internal(char const*, int, void*) + 185
    frame #12: 0x00007fff2044909e libdyld.dylib`dlopen + 28
    frame #13: 0x000000010f700420
    frame #14: 0x000000010f6f4521
    frame #15: 0x000000010f0bdce2 mono`mono_jit_runtime_invoke(method=0x00007fb782c17d78, obj=, params=0x00007ffee0b52d08, exc=, error=0x00007ffee0b52d40) at mini-runtime.c:3439:12 [opt]
    frame #16: 0x000000010f2d0968 mono`mono_runtime_invoke_checked [inlined] do_runtime_invoke(method=0x00007fb782c17d78, obj=0x0000000000000000, params=0x00007ffee0b52d08, exc=0x0000000000000000, error=0x00007ffee0b52d40) at object.c:3056:11 [opt]
    frame #17: 0x000000010f2d0932 mono`mono_runtime_invoke_checked(method=0x00007fb782c17d78, obj=0x0000000000000000, params=0x00007ffee0b52d08, error=0x00007ffee0b52d40) at object.c:3224 [opt]
    frame #18: 0x000000010f2d7e75 mono`mono_runtime_exec_main_checked [inlined] do_exec_main_checked(method=0x00007fb782c17d78, args=, error=0x00007ffee0b52d40) at object.c:0 [opt]
    frame #19: 0x000000010f2d7e39 mono`mono_runtime_exec_main_checked(method=0x00007fb782c17d78, args=, error=0x00007ffee0b52d40) at object.c:5395 [opt]
    frame #20: 0x000000010f2d7f05 mono`mono_runtime_run_main_checked(method=, argc=, argv=, error=) at object.c:4731:9 [opt] [artificial]
    frame #21: 0x000000010f11bdcc mono`mono_jit_exec at driver.c:1398:13 [opt]
    frame #22: 0x000000010f11bdbe mono`mono_jit_exec(domain=, assembly=, argc=2, argv=0x00007ffee0b530d0) at driver.c:1343 [opt]
    frame #23: 0x000000010f11ef7f mono`mono_main [inlined] main_thread_handler(user_data=) at driver.c:1480:3 [opt]
    frame #24: 0x000000010f11eefa mono`mono_main(argc=, argv=) at driver.c:2778 [opt]
    frame #25: 0x000000010f0ae65b mono`main [inlined] mono_main_with_options(argc=, argv=) at main.c:54:9 [opt]
    frame #26: 0x000000010f0ae644 mono`main(argc=3, argv=) at main.c:402 [opt]
    frame #27: 0x00007fff20458f5d libdyld.dylib`start + 1
    frame #28: 0x00007fff20458f5d libdyld.dylib`start + 1
  thread #2, name = 'SGen worker'
    frame #0: 0x00007fff2040acde libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff2043de49 libsystem_pthread.dylib`_pthread_cond_wait + 1298
    frame #2: 0x000000010f393c13 mono`thread_func [inlined] mono_os_cond_wait(cond=, mutex=) at mono-os-mutex.h:219:8 [opt]
    frame #3: 0x000000010f393bf4 mono`thread_func at sgen-thread-pool.c:167 [opt]
    frame #4: 0x000000010f393bf4 mono`thread_func(data=0x0000000000000000) at sgen-thread-pool.c:198 [opt]
    frame #5: 0x00007fff2043d8fc libsystem_pthread.dylib`_pthread_start + 224
    frame #6: 0x00007fff20439443 libsystem_pthread.dylib`thread_start + 15
  thread #3, name = 'Crash Report Leader'
    frame #0: 0x00007fff2040a4ca libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff2043b2ab libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 76
    frame #2: 0x00007fff20439192 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 204
    frame #3: 0x00007fff20457a36 libdyld.dylib`LockHelper::LockHelper() + 16
    frame #4: 0x00007fff20456723 libdyld.dylib`dladdr + 130
    frame #5: 0x000000010f3c7f39 mono`monoeg_g_module_address(addr=0x000000011ee53b7c, file_name="\x88\x88d\U00000002", file_name_len=256, file_base=0x0000700002648248, sym_name="", sym_name_len=256, sym_addr=0x0000700002648240) at gmodule-unix.c:90:12 [opt]
    frame #6: 0x000000010f15fc77 mono`summarize_frame [inlined] mono_get_portable_ip(in_ip=4813306748, out_ip=, out_offset=, out_module=, out_name=0x0000000000000000) at mini-exceptions.c:1552:21 [opt]
    frame #7: 0x000000010f15fc2e mono`summarize_frame(frame=0x0000700002648810, ctx=, data=0x00007000026489e8) at mini-exceptions.c:1683 [opt]
    frame #8: 0x000000010f15b4fd mono`mono_walk_stack_full(func=(mono`summarize_frame at mini-exceptions.c:1673), start_ctx=, domain=0x00007fb782c16050, jit_tls=0x00007fb783824600, lmf=0x00007fb784809560, unwind_options=MONO_UNWIND_LOOKUP_IL_OFFSET, user_data=0x00007000026489e8, crash_context=1) at mini-exceptions.c:1382:7 [opt]
    frame #9: 0x000000010f1595c4 mono`mono_summarize_managed_stack(out=0x000000011269a000) at mini-exceptions.c:1749:2 [opt]
    frame #10: 0x000000010f2fd72b mono`summarizer_state_wait_and_term at threads.c:7153:3 [opt]
    frame #11: 0x000000010f2fd6c0 mono`summarizer_state_wait_and_term(caller_tid=, state=0x000000010f4a0938, out=0x00007ffee0b51a90, working_mem="", provided_size=10000000, originator_summary=0x000000011269a000) at threads.c:7211 [opt]
    frame #12: 0x000000010f2fb610 mono`summarizer_leader at threads.c:6839:4 [opt]
    frame #13: 0x000000010f2fc6e3 mono`start_wrapper_internal(start_info=0x0000000000000000, stack_ptr=) at threads.c:1251:3 [opt]
    frame #14: 0x000000010f2fc52e mono`start_wrapper(data=0x00007fb784809630) at threads.c:1326:8 [opt]
    frame #15: 0x00007fff2043d8fc libsystem_pthread.dylib`_pthread_start + 224
    frame #16: 0x00007fff20439443 libsystem_pthread.dylib`thread_start + 15
  thread #4, name = 'Finalizer'
    frame #0: 0x00007fff2040a4ca libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff2043b2ab libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 76
    frame #2: 0x00007fff20439192 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 204
    frame #3: 0x00007fff20457a36 libdyld.dylib`LockHelper::LockHelper() + 16
    frame #4: 0x00007fff20456723 libdyld.dylib`dladdr + 130
    frame #5: 0x000000010f3c7f39 mono`monoeg_g_module_address(addr=0x000000010f159658, file_name="", file_name_len=256, file_base=0x000070000284b1d0, sym_name="", sym_name_len=256, sym_addr=0x000070000284b1c8) at gmodule-unix.c:90:12 [opt]
    frame #6: 0x000000010f159706 mono`mono_summarize_unmanaged_stack [inlined] mono_get_portable_ip(in_ip=4548040280, out_ip=, out_offset=, out_module=, out_name=) at mini-exceptions.c:1552:21 [opt]
    frame #7: 0x000000010f1596b8 mono`mono_summarize_unmanaged_stack(out=) at mini-exceptions.c:1778 [opt]
    frame #8: 0x000000010f2fbc8d mono`mono_threads_summarize_execute_internal [inlined] mono_threads_summarize_native_self(out=0x0000000112710000, ctx=0x000070000284b788) at threads.c:6452:2 [opt]
    frame #9: 0x000000010f2fbc48 mono`mono_threads_summarize_execute_internal(ctx=, out=0x000070000284b920, hashes=0x000070000284b910, silent=, working_mem=0x0000000000000000, provided_size=0, this_thread_controls=0) at threads.c:7307 [opt]
    frame #10: 0x000000010f2fb907 mono`mono_threads_summarize_execute(ctx=0x000070000284b788, out=0x000070000284b920, hashes=0x000070000284b910, silent=0, working_mem=0x0000000000000000, provided_size=0) at threads.c:7361:11 [opt]
    frame #11: 0x000000010f1c1720 mono`sigterm_signal_handler(_dummy=15, _info=0x000070000284bd78, context=0x000070000284bde0) at mini-posix.c:259:8 [opt]
    frame #12: 0x00007fff20482d7d libsystem_platform.dylib`_sigtramp + 29
    frame #13: 0x00007fff204082f7 libsystem_kernel.dylib`semaphore_wait_trap + 11
    frame #14: 0x000000010f340b2b mono`finalizer_thread [inlined] mono_os_sem_wait(sem=, flags=MONO_SEM_FLAGS_ALERTABLE) at mono-os-semaphore.h:85:8 [opt]
    frame #15: 0x000000010f340b20 mono`finalizer_thread at mono-coop-semaphore.h:41 [opt]
    frame #16: 0x000000010f340b06 mono`finalizer_thread(unused=0x0000000000000000) at gc.c:970 [opt]
    frame #17: 0x000000010f2fc6e3 mono`start_wrapper_internal(start_info=0x0000000000000000, stack_ptr=) at threads.c:1251:3 [opt]
    frame #18: 0x000000010f2fc52e mono`start_wrapper(data=0x00007fb78480ee20) at threads.c:1326:8 [opt]
    frame #19: 0x00007fff2043d8fc libsystem_pthread.dylib`_pthread_start + 224
    frame #20: 0x00007fff20439443 libsystem_pthread.dylib`thread_start + 15
  thread #5
    frame #0: 0x00007fff2040995e libsystem_kernel.dylib`__workq_kernreturn + 10
    frame #1: 0x00007fff2043a4c1 libsystem_pthread.dylib`_pthread_wqthread + 414
    frame #2: 0x00007fff2043942f libsystem_pthread.dylib`start_wqthread + 15

@lambdageek lambdageek marked this pull request as ready for review June 9, 2021 19:57
@lambdageek lambdageek requested a review from thaystg as a code owner June 9, 2021 19:57
@imhameed
Copy link
Contributor

The "let's have a crash in dlopen" test is failing. because we call dladdr to get the symbols for the stack traces, and it's deadlocking with the crashed thread which is holding the lock internal to dyld:

We're quite stuck here:

1. To get the managed and unmanaged stacks we need to call `dladdr` in order to get the module and symbol addresses for each frame;

2. If the crash originator crashed in dyld, it may hold the underlying lock.

dyld2 uses a recursive mutex for this; maybe you could do message passing between the crash handler thread and the thread with an active signal handler running on it?

We could also maintain a local table of pointers to Mach-O image headers using _dyld_register_func_for_add_image and _dyld_register_func_for_remove_image although then we'd have to implement Mach-O header parsing and symbol table parsing ourselves.

It's straightline code with two early exits. State machine is overkill
@lambdageek
Copy link
Member Author

dyld2 uses a recursive mutex for this; maybe you could do message passing between the crash handler thread and the thread with an active signal handler running on it?

Yea that's an idea. That will help if we think we often have crashes when we call the dyld functions.

It won't help if we there's a dlopen, say, in progress while another thread crashes - in that case the dyld lock will be in a third thread that is neither the crash originator, nor the crash report leader.

We could also maintain a local table of pointers to Mach-O image headers using _dyld_register_func_for_add_image and _dyld_register_func_for_remove_image although then we'd have to implement Mach-O header parsing and symbol table parsing ourselves.

At that point we're better off with an external crash dump tool like coreclr.

@lambdageek
Copy link
Member Author

Removed the state machine. Could I get another review?

@jaykrell
Copy link
Contributor

Do you really need dladdr?
Is it only against mono?
Can you look at your own image yourself?
Or build data mostly in the build?
I mean, I don't think you need to register with dyld or parse macho. Though either sounds easy enough.
I think as mono loads images it can record their paths and ranges.
Due to the locking and querying aspect, I think, one idea, have a fixed sized array, 256 entries, and when it fills up, just leave it alone.
But yeah, doing this stuff in-proc has always been a bad idea.

Copy link
Contributor

@CoffeeFlux CoffeeFlux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm a bit scared to sign off on this one, generally LGTM.

LEADER_RESPONSE_STACKS_WALKED = 3,
};

#undef LEADER_DEBUG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually defined elsewhere? Wouldn't you want to put the opposite of this (and then comment it) for aiding with debugging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a placeholder for if someone wants to build a runtime to debug this issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to a commented out define


mono_atomic_store_i32 (&summarizer_leader_data.leader_running, 1);
while (TRUE) {
/*case LEADER_STATE_READY:*/ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider just putting a normal comment instead of leaving an artifact of the previous state machine design.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@lambdageek
Copy link
Member Author

@monojenkins build failed

@lambdageek
Copy link
Member Author

@monojenkins backport to 2020-02

@lambdageek
Copy link
Member Author

@monojenkins build Linux i386

1 similar comment
@lambdageek
Copy link
Member Author

@monojenkins build Linux i386

@lambdageek
Copy link
Member Author

Manually backported to 2020-02: #21126

@lambdageek lambdageek merged commit df31545 into mono:main Jun 23, 2021
@lambdageek lambdageek deleted the merp-dedicated-thread branch June 23, 2021 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants