-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[crash] Test and fix stability of native crash pipeline(s) #12125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
77f82ae to
4f9af7a
Compare
|
So I managed to get it reporting "no crash" most of the time (successfully run 50 runs on OSX). I run into a few issues with our sequence point code and with our jit lookup code (outside of the crash reporter path) crashing sometimes. These tests really showed just how unstable a lot of the other code along the native crash path is. It's interesting to me that subjectively, it doesn't appear to be very unstable. When run in a loop like this, even with crash reporting compiled out of the runtime, you see a ton of failures (1 per 200-300 runs). Most of them seem non-reproducible and related to sequence point and JIT internal memory. It would be a good idea to run this "stress test" a lot and to make a lot of bugs for the bug pool from it. |
|
I'm going to talk to one of the MERP folks and get some kind of schema I can assert about the output config files for this test. If we can cover the file contents, we should remove any bug surface for integration "surprises". |
|
Really rare but still interesting crash: |
|
Looks like condvar waiting is inherently dangerous because segfaults on OSX are sent to every thread waiting on a condvar, along with the thread that triggered the segfault. Doesn't seem to impact semaphores. I had noticed this a while back when it was suggested I use condvars for the dumper; it didn't really work. That's why we're using semaphores now. |
|
Looks like forking isn't async-safe on OSX. No suggested workarounds. |
Seeing this one more and more often, may be one of the last big remaining problems. |
|
I've got a partial fix for the handling of the double-signal situation. It won't hang now. May remain uncaught though. |
|
A bunch of fixes are coming. They'll be varied and semantically unrelated. Do not squash. |
7871666 to
6fea905
Compare
6fea905 to
8ac091e
Compare
f7491b7 to
9b91d9d
Compare
On Wasm: ^ |
5f03bbc to
8694651
Compare
Threads without domains that get segfaults will end up in this handler. It's not safe to call this function with a NULL domain. See crash below: ``` * thread mono#1, name = 'tid_307', queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x10eff40f8) * frame #0: 0x000000010e1510d9 mono-sgen`mono_threads_summarize_execute(ctx=0x0000000000000000, out=0x0000001000000000, hashes=0x0000100000100000, silent=4096, mem="", provided_size=2199023296512) at threads.c:6414 frame mono#1: 0x000000010e152092 mono-sgen`mono_threads_summarize(ctx=0x000000010effda00, out=0x000000010effdba0, hashes=0x000000010effdb90, silent=0, signal_handler_controller=1, mem=0x0000000000000000, provided_size=0) at threads.c:6508 frame mono#2: 0x000000010df7c69f mono-sgen`dump_native_stacktrace(signal="SIGSEGV", ctx=0x000000010effef48) at mini-posix.c:1026 frame mono#3: 0x000000010df7c37f mono-sgen`mono_dump_native_crash_info(signal="SIGSEGV", ctx=0x000000010effef48, info=0x000000010effeee0) at mini-posix.c:1147 frame mono#4: 0x000000010de720a9 mono-sgen`mono_handle_native_crash(signal="SIGSEGV", ctx=0x000000010effef48, info=0x000000010effeee0) at mini-exceptions.c:3227 frame mono#5: 0x000000010dd6ac0d mono-sgen`mono_sigsegv_signal_handler_debug(_dummy=11, _info=0x000000010effeee0, context=0x000000010effef48, debug_fault_addr=0xffffffffffffffff) at mini-runtime.c:3574 frame mono#6: 0x000000010dd6a8d3 mono-sgen`mono_sigsegv_signal_handler(_dummy=11, _info=0x000000010effeee0, context=0x000000010effef48) at mini-runtime.c:3612 frame mono#7: 0x00007fff73dbdf5a libsystem_platform.dylib`_sigtramp + 26 frame mono#8: 0x0000000110bb81c1 frame mono#9: 0x000000011085ffe1 frame mono#10: 0x000000010dd6d4f3 mono-sgen`mono_jit_runtime_invoke(method=0x00007faae4f01fe8, obj=0x0000000000000000, params=0x00007ffee1eaa180, exc=0x00007ffee1ea9f08, error=0x00007ffee1eaa250) at mini-runtime.c:3215 frame mono#11: 0x000000010e11509d mono-sgen`do_runtime_invoke(method=0x00007faae4f01fe8, obj=0x0000000000000000, params=0x00007ffee1eaa180, exc=0x0000000000000000, error=0x00007ffee1eaa250) at object.c:2977 frame mono#12: 0x000000010e10d961 mono-sgen`mono_runtime_invoke_checked(method=0x00007faae4f01fe8, obj=0x0000000000000000, params=0x00007ffee1eaa180, error=0x00007ffee1eaa250) at object.c:3145 frame mono#13: 0x000000010e11aa58 mono-sgen`do_exec_main_checked(method=0x00007faae4f01fe8, args=0x000000010f0003e8, error=0x00007ffee1eaa250) at object.c:5042 frame mono#14: 0x000000010e118803 mono-sgen`mono_runtime_exec_main_checked(method=0x00007faae4f01fe8, args=0x000000010f0003e8, error=0x00007ffee1eaa250) at object.c:5138 frame mono#15: 0x000000010e118856 mono-sgen`mono_runtime_run_main_checked(method=0x00007faae4f01fe8, argc=2, argv=0x00007ffee1eaa760, error=0x00007ffee1eaa250) at object.c:4599 frame mono#16: 0x000000010de1db2f mono-sgen`mono_jit_exec_internal(domain=0x00007faae4f00860, assembly=0x00007faae4c02ab0, argc=2, argv=0x00007ffee1eaa760) at driver.c:1298 frame mono#17: 0x000000010de1d95d mono-sgen`mono_jit_exec(domain=0x00007faae4f00860, assembly=0x00007faae4c02ab0, argc=2, argv=0x00007ffee1eaa760) at driver.c:1257 frame mono#18: 0x000000010de2257f mono-sgen`main_thread_handler(user_data=0x00007ffee1eaa6a0) at driver.c:1375 frame mono#19: 0x000000010de20852 mono-sgen`mono_main(argc=3, argv=0x00007ffee1eaa758) at driver.c:2551 frame mono#20: 0x000000010dd56d7e mono-sgen`mono_main_with_options(argc=3, argv=0x00007ffee1eaa758) at main.c:50 frame mono#21: 0x000000010dd5638d mono-sgen`main(argc=3, argv=0x00007ffee1eaa758) at main.c:406 frame mono#22: 0x00007fff73aaf015 libdyld.dylib`start + 1 frame mono#23: 0x00007fff73aaf015 libdyld.dylib`start + 1 thread mono#2, name = 'SGen worker' frame #0: 0x000000010e2afd77 mono-sgen`mono_get_hazardous_pointer(pp=0x0000000000000178, hp=0x000000010ef87618, hazard_index=0) at hazard-pointer.c:208 frame mono#1: 0x000000010e0b28e1 mono-sgen`mono_jit_info_table_find_internal(domain=0x0000000000000000, addr=0x00007fff73bffa16, try_aot=1, allow_trampolines=1) at jit-info.c:304 frame mono#2: 0x000000010dd6aa5f mono-sgen`mono_sigsegv_signal_handler_debug(_dummy=11, _info=0x000070000fb81c58, context=0x000070000fb81cc0, debug_fault_addr=0x000000010e28fb20) at mini-runtime.c:3540 frame mono#3: 0x000000010dd6a8d3 mono-sgen`mono_sigsegv_signal_handler(_dummy=11, _info=0x000070000fb81c58, context=0x000070000fb81cc0) at mini-runtime.c:3612 frame mono#4: 0x00007fff73dbdf5a libsystem_platform.dylib`_sigtramp + 26 frame mono#5: 0x00007fff73bffa17 libsystem_kernel.dylib`__psynch_cvwait + 11 frame mono#6: 0x00007fff73dc8589 libsystem_pthread.dylib`_pthread_cond_wait + 732 frame mono#7: 0x000000010e28d76d mono-sgen`mono_os_cond_wait(cond=0x000000010e44c9d8, mutex=0x000000010e44c998) at mono-os-mutex.h:168 frame mono#8: 0x000000010e28df4f mono-sgen`get_work(worker_index=0, work_context=0x000070000fb81ee0, do_idle=0x000070000fb81ed4, job=0x000070000fb81ec8) at sgen-thread-pool.c:165 frame mono#9: 0x000000010e28d2cb mono-sgen`thread_func(data=0x0000000000000000) at sgen-thread-pool.c:196 frame mono#10: 0x00007fff73dc7661 libsystem_pthread.dylib`_pthread_body + 340 frame mono#11: 0x00007fff73dc750d libsystem_pthread.dylib`_pthread_start + 377 frame mono#12: 0x00007fff73dc6bf9 libsystem_pthread.dylib`thread_start + 13 thread mono#3, name = 'Finalizer' frame #0: 0x00007fff73bf6246 libsystem_kernel.dylib`semaphore_wait_trap + 10 frame mono#1: 0x000000010e1d9c0a mono-sgen`mono_os_sem_wait(sem=0x000000010e43e400, flags=MONO_SEM_FLAGS_ALERTABLE) at mono-os-semaphore.h:84 frame mono#2: 0x000000010e1d832d mono-sgen`mono_coop_sem_wait(sem=0x000000010e43e400, flags=MONO_SEM_FLAGS_ALERTABLE) at mono-coop-semaphore.h:41 frame mono#3: 0x000000010e1da787 mono-sgen`finalizer_thread(unused=0x0000000000000000) at gc.c:920 frame mono#4: 0x000000010e152919 mono-sgen`start_wrapper_internal(start_info=0x0000000000000000, stack_ptr=0x000070000fd85000) at threads.c:1178 frame mono#5: 0x000000010e1525b6 mono-sgen`start_wrapper(data=0x00007faae4f31bd0) at threads.c:1238 frame mono#6: 0x00007fff73dc7661 libsystem_pthread.dylib`_pthread_body + 340 frame mono#7: 0x00007fff73dc750d libsystem_pthread.dylib`_pthread_start + 377 frame mono#8: 0x00007fff73dc6bf9 libsystem_pthread.dylib`thread_start + 13 thread mono#4 frame #0: 0x00007fff73c0028a libsystem_kernel.dylib`__workq_kernreturn + 10 frame mono#1: 0x00007fff73dc7009 libsystem_pthread.dylib`_pthread_wqthread + 1035 frame mono#2: 0x00007fff73dc6be9 libsystem_pthread.dylib`start_wqthread + 13 (lldb) ```
Each frame that prints ends up increased by the size of buff. In practice, clang often fails to deduplicate some of these buffers, leading to 30k-big stackframes. It was noticed by a series of hard-to-diagnose segfaults on stacks that looked otherwise fine during the crash reporting stress test. This change fixes this, making stacks a 1/10th of the size. It doesn't seem to break the crash reporter messages anywhere (may need to shrink other "max name length" fields), and it's not mission-critical anywhere else.
4704dd1 to
1efcea6
Compare
For Linux AArch64 Coop Suspend https://jenkins.mono-project.com/job/test-mono-pull-request-coop-arm64/7904/ |
|
@luhenry could you give this a look when you have the chance |
The added test caught a lot of failures. It failed most of the time it was run before (90% of executions resulting in crashes in seq-point or jit info tables or in gdb dumper).
After these changes, I was able to run the suite 10k times on OSX and 10k times on Linux without seeing a failure.
This brings some crash format changes too:
https://gist.github.com/alexanderkyte/72c12f0450513f079189e3e6561db502
I think this is a good idea. If we've got so many crash-time state reporters, having visual demarcation between them makes it easier to say "we crashed in the middle of doing X".
Please don't squash these changes, as they're rather varied, and would not offer a single semantic unit to be rebased around and reverted.