-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Start a dedicated thread for MERP crash reporting #21096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
At process startup, start a separate thread that is attached to the runtime and can collect crash reports. Crashing threads will wake it and wait for it to collect the crash reports
We need to coordinate the originator and the leader in a few places. The leader needs to pause to after collecting the thread ids before suspending the non-originator threads, and again while the originator is dumping its own stack. The originator needs to wait for the leader to collect the thread IDs and to tell it its assigned slot. Then it tells the leader to suspend the others, dumps its own memory, then tell the leader to dump the whole crash report and wait for it to reply when it's done.
either because the crash leader crashed, or because the process got a SIGTERM and it arrived on the crash leader thread
e3555c5 to
21e65ec
Compare
|
The "let's have a crash in We're quite stuck here:
So the question is: Do we think it is likely that important crashes are happening because corrupted arguments are passed to Sample stack trace of a deadlock* thread #1, name = 'tid_103', queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007fff20408cce libsystem_kernel.dylib`read + 10
frame #1: 0x000000010f2fbf27 mono`mono_threads_summarize_execute_internal at threads.c:6694:13 [opt]
frame #2: 0x000000010f2fbf0a mono`mono_threads_summarize_execute_internal(ctx=, out=0x00007ffee0b51a90, hashes=0x00007ffee0b51aa0, silent=, working_mem="", provided_size=10000000, this_thread_controls=1) at threads.c:7329 [opt]
frame #3: 0x000000010f2fc1e9 mono`mono_threads_summarize(ctx=0x00007fb783824cb0, out=, hashes=, silent=0, signal_handler_controller=, mem=, provided_size=10000000) at threads.c:7410:15 [opt]
frame #4: 0x000000010f1c2504 mono`mono_dump_native_crash_info at mini-posix.c:987:13 [opt]
frame #5: 0x000000010f1c22f2 mono`mono_dump_native_crash_info(signal="segv", mctx=0x00007fb783824cb0, info=) at mini-posix.c:1105 [opt]
frame #6: 0x000000010f15e48e mono`mono_handle_native_crash(signal="segv", mctx=0x00007fb783824cb0, info=0x0000000000000000) at mini-exceptions.c:3468:2 [opt]
frame #7: 0x000000010f3c92a7 mono`altstack_handle_and_restore.cold.1 at exceptions-amd64.c:880:4 [opt]
frame #8: 0x000000010f1bcfbe mono`altstack_handle_and_restore(ctx=, obj=, flags=) at exceptions-amd64.c:879:7 [opt]
frame #9: 0x000000011ee53b7c dyld`_platform_strchr$VARIANT$Base + 28
frame #10: 0x000000011ee0d954 dyld`dlopen_internal + 206
frame #11: 0x00007fff2045acb4 libdyld.dylib`dlopen_internal(char const*, int, void*) + 185
frame #12: 0x00007fff2044909e libdyld.dylib`dlopen + 28
frame #13: 0x000000010f700420
frame #14: 0x000000010f6f4521
frame #15: 0x000000010f0bdce2 mono`mono_jit_runtime_invoke(method=0x00007fb782c17d78, obj=, params=0x00007ffee0b52d08, exc=, error=0x00007ffee0b52d40) at mini-runtime.c:3439:12 [opt]
frame #16: 0x000000010f2d0968 mono`mono_runtime_invoke_checked [inlined] do_runtime_invoke(method=0x00007fb782c17d78, obj=0x0000000000000000, params=0x00007ffee0b52d08, exc=0x0000000000000000, error=0x00007ffee0b52d40) at object.c:3056:11 [opt]
frame #17: 0x000000010f2d0932 mono`mono_runtime_invoke_checked(method=0x00007fb782c17d78, obj=0x0000000000000000, params=0x00007ffee0b52d08, error=0x00007ffee0b52d40) at object.c:3224 [opt]
frame #18: 0x000000010f2d7e75 mono`mono_runtime_exec_main_checked [inlined] do_exec_main_checked(method=0x00007fb782c17d78, args=, error=0x00007ffee0b52d40) at object.c:0 [opt]
frame #19: 0x000000010f2d7e39 mono`mono_runtime_exec_main_checked(method=0x00007fb782c17d78, args=, error=0x00007ffee0b52d40) at object.c:5395 [opt]
frame #20: 0x000000010f2d7f05 mono`mono_runtime_run_main_checked(method=, argc=, argv=, error=) at object.c:4731:9 [opt] [artificial]
frame #21: 0x000000010f11bdcc mono`mono_jit_exec at driver.c:1398:13 [opt]
frame #22: 0x000000010f11bdbe mono`mono_jit_exec(domain=, assembly=, argc=2, argv=0x00007ffee0b530d0) at driver.c:1343 [opt]
frame #23: 0x000000010f11ef7f mono`mono_main [inlined] main_thread_handler(user_data=) at driver.c:1480:3 [opt]
frame #24: 0x000000010f11eefa mono`mono_main(argc=, argv=) at driver.c:2778 [opt]
frame #25: 0x000000010f0ae65b mono`main [inlined] mono_main_with_options(argc=, argv=) at main.c:54:9 [opt]
frame #26: 0x000000010f0ae644 mono`main(argc=3, argv=) at main.c:402 [opt]
frame #27: 0x00007fff20458f5d libdyld.dylib`start + 1
frame #28: 0x00007fff20458f5d libdyld.dylib`start + 1
thread #2, name = 'SGen worker'
frame #0: 0x00007fff2040acde libsystem_kernel.dylib`__psynch_cvwait + 10
frame #1: 0x00007fff2043de49 libsystem_pthread.dylib`_pthread_cond_wait + 1298
frame #2: 0x000000010f393c13 mono`thread_func [inlined] mono_os_cond_wait(cond=, mutex=) at mono-os-mutex.h:219:8 [opt]
frame #3: 0x000000010f393bf4 mono`thread_func at sgen-thread-pool.c:167 [opt]
frame #4: 0x000000010f393bf4 mono`thread_func(data=0x0000000000000000) at sgen-thread-pool.c:198 [opt]
frame #5: 0x00007fff2043d8fc libsystem_pthread.dylib`_pthread_start + 224
frame #6: 0x00007fff20439443 libsystem_pthread.dylib`thread_start + 15
thread #3, name = 'Crash Report Leader'
frame #0: 0x00007fff2040a4ca libsystem_kernel.dylib`__psynch_mutexwait + 10
frame #1: 0x00007fff2043b2ab libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 76
frame #2: 0x00007fff20439192 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 204
frame #3: 0x00007fff20457a36 libdyld.dylib`LockHelper::LockHelper() + 16
frame #4: 0x00007fff20456723 libdyld.dylib`dladdr + 130
frame #5: 0x000000010f3c7f39 mono`monoeg_g_module_address(addr=0x000000011ee53b7c, file_name="\x88\x88d\U00000002", file_name_len=256, file_base=0x0000700002648248, sym_name="", sym_name_len=256, sym_addr=0x0000700002648240) at gmodule-unix.c:90:12 [opt]
frame #6: 0x000000010f15fc77 mono`summarize_frame [inlined] mono_get_portable_ip(in_ip=4813306748, out_ip=, out_offset=, out_module=, out_name=0x0000000000000000) at mini-exceptions.c:1552:21 [opt]
frame #7: 0x000000010f15fc2e mono`summarize_frame(frame=0x0000700002648810, ctx=, data=0x00007000026489e8) at mini-exceptions.c:1683 [opt]
frame #8: 0x000000010f15b4fd mono`mono_walk_stack_full(func=(mono`summarize_frame at mini-exceptions.c:1673), start_ctx=, domain=0x00007fb782c16050, jit_tls=0x00007fb783824600, lmf=0x00007fb784809560, unwind_options=MONO_UNWIND_LOOKUP_IL_OFFSET, user_data=0x00007000026489e8, crash_context=1) at mini-exceptions.c:1382:7 [opt]
frame #9: 0x000000010f1595c4 mono`mono_summarize_managed_stack(out=0x000000011269a000) at mini-exceptions.c:1749:2 [opt]
frame #10: 0x000000010f2fd72b mono`summarizer_state_wait_and_term at threads.c:7153:3 [opt]
frame #11: 0x000000010f2fd6c0 mono`summarizer_state_wait_and_term(caller_tid=, state=0x000000010f4a0938, out=0x00007ffee0b51a90, working_mem="", provided_size=10000000, originator_summary=0x000000011269a000) at threads.c:7211 [opt]
frame #12: 0x000000010f2fb610 mono`summarizer_leader at threads.c:6839:4 [opt]
frame #13: 0x000000010f2fc6e3 mono`start_wrapper_internal(start_info=0x0000000000000000, stack_ptr=) at threads.c:1251:3 [opt]
frame #14: 0x000000010f2fc52e mono`start_wrapper(data=0x00007fb784809630) at threads.c:1326:8 [opt]
frame #15: 0x00007fff2043d8fc libsystem_pthread.dylib`_pthread_start + 224
frame #16: 0x00007fff20439443 libsystem_pthread.dylib`thread_start + 15
thread #4, name = 'Finalizer'
frame #0: 0x00007fff2040a4ca libsystem_kernel.dylib`__psynch_mutexwait + 10
frame #1: 0x00007fff2043b2ab libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 76
frame #2: 0x00007fff20439192 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 204
frame #3: 0x00007fff20457a36 libdyld.dylib`LockHelper::LockHelper() + 16
frame #4: 0x00007fff20456723 libdyld.dylib`dladdr + 130
frame #5: 0x000000010f3c7f39 mono`monoeg_g_module_address(addr=0x000000010f159658, file_name="", file_name_len=256, file_base=0x000070000284b1d0, sym_name="", sym_name_len=256, sym_addr=0x000070000284b1c8) at gmodule-unix.c:90:12 [opt]
frame #6: 0x000000010f159706 mono`mono_summarize_unmanaged_stack [inlined] mono_get_portable_ip(in_ip=4548040280, out_ip=, out_offset=, out_module=, out_name=) at mini-exceptions.c:1552:21 [opt]
frame #7: 0x000000010f1596b8 mono`mono_summarize_unmanaged_stack(out=) at mini-exceptions.c:1778 [opt]
frame #8: 0x000000010f2fbc8d mono`mono_threads_summarize_execute_internal [inlined] mono_threads_summarize_native_self(out=0x0000000112710000, ctx=0x000070000284b788) at threads.c:6452:2 [opt]
frame #9: 0x000000010f2fbc48 mono`mono_threads_summarize_execute_internal(ctx=, out=0x000070000284b920, hashes=0x000070000284b910, silent=, working_mem=0x0000000000000000, provided_size=0, this_thread_controls=0) at threads.c:7307 [opt]
frame #10: 0x000000010f2fb907 mono`mono_threads_summarize_execute(ctx=0x000070000284b788, out=0x000070000284b920, hashes=0x000070000284b910, silent=0, working_mem=0x0000000000000000, provided_size=0) at threads.c:7361:11 [opt]
frame #11: 0x000000010f1c1720 mono`sigterm_signal_handler(_dummy=15, _info=0x000070000284bd78, context=0x000070000284bde0) at mini-posix.c:259:8 [opt]
frame #12: 0x00007fff20482d7d libsystem_platform.dylib`_sigtramp + 29
frame #13: 0x00007fff204082f7 libsystem_kernel.dylib`semaphore_wait_trap + 11
frame #14: 0x000000010f340b2b mono`finalizer_thread [inlined] mono_os_sem_wait(sem=, flags=MONO_SEM_FLAGS_ALERTABLE) at mono-os-semaphore.h:85:8 [opt]
frame #15: 0x000000010f340b20 mono`finalizer_thread at mono-coop-semaphore.h:41 [opt]
frame #16: 0x000000010f340b06 mono`finalizer_thread(unused=0x0000000000000000) at gc.c:970 [opt]
frame #17: 0x000000010f2fc6e3 mono`start_wrapper_internal(start_info=0x0000000000000000, stack_ptr=) at threads.c:1251:3 [opt]
frame #18: 0x000000010f2fc52e mono`start_wrapper(data=0x00007fb78480ee20) at threads.c:1326:8 [opt]
frame #19: 0x00007fff2043d8fc libsystem_pthread.dylib`_pthread_start + 224
frame #20: 0x00007fff20439443 libsystem_pthread.dylib`thread_start + 15
thread #5
frame #0: 0x00007fff2040995e libsystem_kernel.dylib`__workq_kernreturn + 10
frame #1: 0x00007fff2043a4c1 libsystem_pthread.dylib`_pthread_wqthread + 414
frame #2: 0x00007fff2043942f libsystem_pthread.dylib`start_wqthread + 15
|
dyld2 uses a recursive mutex for this; maybe you could do message passing between the crash handler thread and the thread with an active signal handler running on it? We could also maintain a local table of pointers to Mach-O image headers using |
It's straightline code with two early exits. State machine is overkill
Yea that's an idea. That will help if we think we often have crashes when we call the dyld functions. It won't help if we there's a
At that point we're better off with an external crash dump tool like coreclr. |
|
Removed the state machine. Could I get another review? |
|
Do you really need dladdr? |
CoffeeFlux
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I'm a bit scared to sign off on this one, generally LGTM.
mono/metadata/threads.c
Outdated
| LEADER_RESPONSE_STACKS_WALKED = 3, | ||
| }; | ||
|
|
||
| #undef LEADER_DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually defined elsewhere? Wouldn't you want to put the opposite of this (and then comment it) for aiding with debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a placeholder for if someone wants to build a runtime to debug this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to a commented out define
mono/metadata/threads.c
Outdated
|
|
||
| mono_atomic_store_i32 (&summarizer_leader_data.leader_running, 1); | ||
| while (TRUE) { | ||
| /*case LEADER_STATE_READY:*/ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider just putting a normal comment instead of leaving an artifact of the previous state machine design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
@monojenkins build failed |
|
@monojenkins backport to 2020-02 |
|
@monojenkins build Linux i386 |
1 similar comment
|
@monojenkins build Linux i386 |
|
Manually backported to 2020-02: #21126 |
Related to #21009.
There are two scenarios:
kill -TERM <pid>), the process can receive the signal on any thread,The crash reporter assumes that the crashing thread is either attached to the runtime, or at least
mono_thread_info_currentor the JIT TLS data are set for the thread. If the thread is truly foreign and it never interacted with Mono, and it crashes, both of those assumptions are false, but Mono's crash reporter signal handlers still run.The solution from this PR is: if crash reporting is enabled, start a dedicated thread at process startup that is a "crash report leader" - when a crash happens, the crashing thread (the crash originator) wakes the leader, and the leader collects the crash report. The crash originator does not do any work that requires being attached to the runtime or to the JIT such as iterating over thread IDs or stack walking.