-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[2018-10] Native Crash Stability Fix Batch #12571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We add a checked build mode that asserts when mono mallocs inside of the crash reporter. It makes risky allocations into assertions. It's useful for automated testing because the double-abort often represents itself as an indefinite hang. If it happens before the thread dumping supervisor process is started, or after it ends, the crash reporter hangs.
Threads without domains that get segfaults will end up in this handler. It's not safe to call this function with a NULL domain. See crash below: ``` * thread mono#1, name = 'tid_307', queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x10eff40f8) * frame #0: 0x000000010e1510d9 mono-sgen`mono_threads_summarize_execute(ctx=0x0000000000000000, out=0x0000001000000000, hashes=0x0000100000100000, silent=4096, mem="", provided_size=2199023296512) at threads.c:6414 frame mono#1: 0x000000010e152092 mono-sgen`mono_threads_summarize(ctx=0x000000010effda00, out=0x000000010effdba0, hashes=0x000000010effdb90, silent=0, signal_handler_controller=1, mem=0x0000000000000000, provided_size=0) at threads.c:6508 frame mono#2: 0x000000010df7c69f mono-sgen`dump_native_stacktrace(signal="SIGSEGV", ctx=0x000000010effef48) at mini-posix.c:1026 frame mono#3: 0x000000010df7c37f mono-sgen`mono_dump_native_crash_info(signal="SIGSEGV", ctx=0x000000010effef48, info=0x000000010effeee0) at mini-posix.c:1147 frame mono#4: 0x000000010de720a9 mono-sgen`mono_handle_native_crash(signal="SIGSEGV", ctx=0x000000010effef48, info=0x000000010effeee0) at mini-exceptions.c:3227 frame mono#5: 0x000000010dd6ac0d mono-sgen`mono_sigsegv_signal_handler_debug(_dummy=11, _info=0x000000010effeee0, context=0x000000010effef48, debug_fault_addr=0xffffffffffffffff) at mini-runtime.c:3574 frame mono#6: 0x000000010dd6a8d3 mono-sgen`mono_sigsegv_signal_handler(_dummy=11, _info=0x000000010effeee0, context=0x000000010effef48) at mini-runtime.c:3612 frame mono#7: 0x00007fff73dbdf5a libsystem_platform.dylib`_sigtramp + 26 frame mono#8: 0x0000000110bb81c1 frame mono#9: 0x000000011085ffe1 frame mono#10: 0x000000010dd6d4f3 mono-sgen`mono_jit_runtime_invoke(method=0x00007faae4f01fe8, obj=0x0000000000000000, params=0x00007ffee1eaa180, exc=0x00007ffee1ea9f08, error=0x00007ffee1eaa250) at mini-runtime.c:3215 frame mono#11: 0x000000010e11509d mono-sgen`do_runtime_invoke(method=0x00007faae4f01fe8, obj=0x0000000000000000, params=0x00007ffee1eaa180, exc=0x0000000000000000, error=0x00007ffee1eaa250) at object.c:2977 frame mono#12: 0x000000010e10d961 mono-sgen`mono_runtime_invoke_checked(method=0x00007faae4f01fe8, obj=0x0000000000000000, params=0x00007ffee1eaa180, error=0x00007ffee1eaa250) at object.c:3145 frame mono#13: 0x000000010e11aa58 mono-sgen`do_exec_main_checked(method=0x00007faae4f01fe8, args=0x000000010f0003e8, error=0x00007ffee1eaa250) at object.c:5042 frame mono#14: 0x000000010e118803 mono-sgen`mono_runtime_exec_main_checked(method=0x00007faae4f01fe8, args=0x000000010f0003e8, error=0x00007ffee1eaa250) at object.c:5138 frame mono#15: 0x000000010e118856 mono-sgen`mono_runtime_run_main_checked(method=0x00007faae4f01fe8, argc=2, argv=0x00007ffee1eaa760, error=0x00007ffee1eaa250) at object.c:4599 frame mono#16: 0x000000010de1db2f mono-sgen`mono_jit_exec_internal(domain=0x00007faae4f00860, assembly=0x00007faae4c02ab0, argc=2, argv=0x00007ffee1eaa760) at driver.c:1298 frame mono#17: 0x000000010de1d95d mono-sgen`mono_jit_exec(domain=0x00007faae4f00860, assembly=0x00007faae4c02ab0, argc=2, argv=0x00007ffee1eaa760) at driver.c:1257 frame mono#18: 0x000000010de2257f mono-sgen`main_thread_handler(user_data=0x00007ffee1eaa6a0) at driver.c:1375 frame mono#19: 0x000000010de20852 mono-sgen`mono_main(argc=3, argv=0x00007ffee1eaa758) at driver.c:2551 frame mono#20: 0x000000010dd56d7e mono-sgen`mono_main_with_options(argc=3, argv=0x00007ffee1eaa758) at main.c:50 frame mono#21: 0x000000010dd5638d mono-sgen`main(argc=3, argv=0x00007ffee1eaa758) at main.c:406 frame mono#22: 0x00007fff73aaf015 libdyld.dylib`start + 1 frame mono#23: 0x00007fff73aaf015 libdyld.dylib`start + 1 thread mono#2, name = 'SGen worker' frame #0: 0x000000010e2afd77 mono-sgen`mono_get_hazardous_pointer(pp=0x0000000000000178, hp=0x000000010ef87618, hazard_index=0) at hazard-pointer.c:208 frame mono#1: 0x000000010e0b28e1 mono-sgen`mono_jit_info_table_find_internal(domain=0x0000000000000000, addr=0x00007fff73bffa16, try_aot=1, allow_trampolines=1) at jit-info.c:304 frame mono#2: 0x000000010dd6aa5f mono-sgen`mono_sigsegv_signal_handler_debug(_dummy=11, _info=0x000070000fb81c58, context=0x000070000fb81cc0, debug_fault_addr=0x000000010e28fb20) at mini-runtime.c:3540 frame mono#3: 0x000000010dd6a8d3 mono-sgen`mono_sigsegv_signal_handler(_dummy=11, _info=0x000070000fb81c58, context=0x000070000fb81cc0) at mini-runtime.c:3612 frame mono#4: 0x00007fff73dbdf5a libsystem_platform.dylib`_sigtramp + 26 frame mono#5: 0x00007fff73bffa17 libsystem_kernel.dylib`__psynch_cvwait + 11 frame mono#6: 0x00007fff73dc8589 libsystem_pthread.dylib`_pthread_cond_wait + 732 frame mono#7: 0x000000010e28d76d mono-sgen`mono_os_cond_wait(cond=0x000000010e44c9d8, mutex=0x000000010e44c998) at mono-os-mutex.h:168 frame mono#8: 0x000000010e28df4f mono-sgen`get_work(worker_index=0, work_context=0x000070000fb81ee0, do_idle=0x000070000fb81ed4, job=0x000070000fb81ec8) at sgen-thread-pool.c:165 frame mono#9: 0x000000010e28d2cb mono-sgen`thread_func(data=0x0000000000000000) at sgen-thread-pool.c:196 frame mono#10: 0x00007fff73dc7661 libsystem_pthread.dylib`_pthread_body + 340 frame mono#11: 0x00007fff73dc750d libsystem_pthread.dylib`_pthread_start + 377 frame mono#12: 0x00007fff73dc6bf9 libsystem_pthread.dylib`thread_start + 13 thread mono#3, name = 'Finalizer' frame #0: 0x00007fff73bf6246 libsystem_kernel.dylib`semaphore_wait_trap + 10 frame mono#1: 0x000000010e1d9c0a mono-sgen`mono_os_sem_wait(sem=0x000000010e43e400, flags=MONO_SEM_FLAGS_ALERTABLE) at mono-os-semaphore.h:84 frame mono#2: 0x000000010e1d832d mono-sgen`mono_coop_sem_wait(sem=0x000000010e43e400, flags=MONO_SEM_FLAGS_ALERTABLE) at mono-coop-semaphore.h:41 frame mono#3: 0x000000010e1da787 mono-sgen`finalizer_thread(unused=0x0000000000000000) at gc.c:920 frame mono#4: 0x000000010e152919 mono-sgen`start_wrapper_internal(start_info=0x0000000000000000, stack_ptr=0x000070000fd85000) at threads.c:1178 frame mono#5: 0x000000010e1525b6 mono-sgen`start_wrapper(data=0x00007faae4f31bd0) at threads.c:1238 frame mono#6: 0x00007fff73dc7661 libsystem_pthread.dylib`_pthread_body + 340 frame mono#7: 0x00007fff73dc750d libsystem_pthread.dylib`_pthread_start + 377 frame mono#8: 0x00007fff73dc6bf9 libsystem_pthread.dylib`thread_start + 13 thread mono#4 frame #0: 0x00007fff73c0028a libsystem_kernel.dylib`__workq_kernreturn + 10 frame mono#1: 0x00007fff73dc7009 libsystem_pthread.dylib`_pthread_wqthread + 1035 frame mono#2: 0x00007fff73dc6be9 libsystem_pthread.dylib`start_wqthread + 13 (lldb) ```
Each frame that prints ends up increased by the size of buff. In practice, clang often fails to deduplicate some of these buffers, leading to 30k-big stackframes. It was noticed by a series of hard-to-diagnose segfaults on stacks that looked otherwise fine during the crash reporting stress test. This change fixes this, making stacks a 1/10th of the size. It doesn't seem to break the crash reporter messages anywhere (may need to shrink other "max name length" fields), and it's not mission-critical anywhere else.
|
@monojenkins build deb with monolite |
|
@monojenkins build failed |
|
Having difficulty reproducing the failure. The logs on CI makes it seem like the gdb phase is running way after the rest of it closes. Adding some assertions locally, I can see some weird behavior around waitpid. I can't force it to act anywhere near what the crash is showing though. |
|
The logs here: imply a crash happening after But the lack of the "managed stacktrace" being printed means it is crashing before the dumper returns. We don't see any gdb or lldb output, but that may be a result of all of the lldb/gdb output being moved to the end of the test output file on jenkins. It happened with the other jobs that seemed to finish successfully. Since we're not seeing the json file printed, that means that the file was not created. The xml file also seems to not have been made. The log showing the run having a This contrasts with the fact that we can't find the relevant files. I'm going to log those file paths in the output and see if CI shows anything weird. |
|
I've ran this like 3,000 times locally on OSX x64 and not seen any failures. |
This contains the commits from
#12125
#12126
and
#12518