Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@benjamn
Copy link
Contributor

@benjamn benjamn commented Mar 7, 2018

Fairly self-explanatory, though there are a number of notable changes, since it's been a while since either of these projects released a new version:

@benjamn benjamn added this to the Release 1.6.2 milestone Mar 7, 2018
@benjamn benjamn self-assigned this Mar 7, 2018
@benjamn benjamn mentioned this pull request Mar 7, 2018
@benjamn
Copy link
Contributor Author

benjamn commented Mar 7, 2018

Oof, those test failures are mid-test segmentation faults, so this is definitely going to require some additional investigation.

@benjamn benjamn force-pushed the update-node-to-8.10.0-and-npm-to-5.7.1 branch from b1f8621 to dee83b1 Compare March 7, 2018 17:19
@benjamn
Copy link
Contributor Author

benjamn commented Mar 9, 2018

Some findings.

I was able to reproduce the segmentation fault locally by running meteor self-test. The crash occurs reliably after a few tests, usually during/after the autoupdate test, though it didn't fail there in the Circle CI tests, so I don't think autoupdate is specifically to blame.

Of course, the segfault error message isn't very useful by itself, so what I really needed was to run the tests with a debug build of Node 8.10.0, and hopefully examine the core dump.

In order to get Node to write core dump files, you have to run

ulimit -c unlimited

or else ulimit will silently forbid dumping multiple gigabytes of core files (understandable!). After running that command, core dumps appear (on Mac OSX) in the /cores/ directory.

What about running a debug build of Node? When you run meteor self-test, the meteor shell script ends up executing the following command:

/path/to/dev_bundle/bin/node --expose-gc /Users/ben/meteor/tools/index.js self-test

so (after building Node with --debug) I simply ran

/path/to/src/node/node_g --expose-gc /Users/ben/meteor/tools/index.js self-test

Once the tests failed, I used lldb to load the core dump:

~/meteor% lldb /Users/ben/src/node/node_g -c /cores/core.7819
(lldb) target create "/Users/ben/src/node/node_g" --core "/cores/core.7819"
Core file '/cores/core.7819' (x86_64) was loaded.
(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x0000131b7ff5dd9c
    frame #1: 0x0000131b7fd04259
    frame #2: 0x0000131b7fd04101
    frame #3: 0x0000000100e05d4b node_g`v8::internal::(anonymous namespace)::Invoke(isolate=0x0000000106000000, is_construct=false, target=Handle<v8::internal::Object> @ 0x00007ffeefbf4fd0, receiver=Handle<v8::internal::Object> @ 0x00007ffeefbf4fc8, argc=0, args=0x0000000000000000, new_target=Handle<v8::internal::Object> @ 0x00007ffeefbf4fc0, message_handling=kReport) at execution.cc:145
    frame #4: 0x0000000100e05694 node_g`v8::internal::(anonymous namespace)::CallInternal(isolate=0x0000000106000000, callable=Handle<v8::internal::Object> @ 0x00007ffeefbf5080, receiver=Handle<v8::internal::Object> @ 0x00007ffeefbf5078, argc=0, argv=0x0000000000000000, message_handling=kReport) at execution.cc:181
    frame #5: 0x0000000100e05556 node_g`v8::internal::Execution::Call(isolate=0x0000000106000000, callable=Handle<v8::internal::Object> @ 0x00007ffeefbf50d0, receiver=Handle<v8::internal::Object> @ 0x00007ffeefbf50c8, argc=0, argv=0x0000000000000000) at execution.cc:191
    frame #6: 0x00000001004a8e8e node_g`v8::Function::Call(this=0x000000010580c8a0, context=(val_ = 0x0000000106045d08), recv=(val_ = 0x000000010580c380), argc=0, argv=0x0000000000000000) at api.cc:5330
    frame #7: 0x00000001004a8ff1 node_g`v8::Function::Call(this=0x000000010580c8a0, recv=(val_ = 0x000000010580c380), argc=0, argv=0x0000000000000000) at api.cc:5339
    frame #8: 0x000000010173098f node_g`node::InternalCallbackScope::Close(this=0x00007ffeefbf53f0) at node.cc:1469
    frame #9: 0x0000000101730e46 node_g`node::InternalMakeCallback(env=0x00007ffeefbfdbb0, recv=(val_ = 0x00000001068335c0), callback=(val_ = 0x0000000106045cf8), argc=2, argv=0x00007ffeefbf55f0, asyncContext=(async_id = 36389, trigger_async_id = 0)) at node.cc:1499
    frame #10: 0x00000001016f535d node_g`node::AsyncWrap::MakeCallback(this=0x0000000104b1e2c0, cb=(val_ = 0x0000000106045cf8), argc=2, argv=0x00007ffeefbf55f0) at async_wrap.cc:769
    frame #11: 0x00000001016fef57 node_g`node::AsyncWrap::MakeCallback(this=0x0000000104b1e2c0, symbol=(val_ = 0x00000001060479a8), argc=2, argv=0x00007ffeefbf55f0) at async_wrap-inl.h:54
    frame #12: 0x000000010182288b node_g`node::(anonymous namespace)::ProcessWrap::OnExit(handle=0x0000000104b1e310, exit_status=0, term_signal=15) at process_wrap.cc:304
    frame #13: 0x0000000101a51b94 node_g`uv__chld(handle=0x0000000102a08080, signum=20) at process.c:109
    frame #14: 0x0000000101a5302a node_g`uv__signal_event(loop=0x0000000102a07d60, w=0x0000000102a08040, events=1) at signal.c:459
    frame #15: 0x0000000101a646d9 node_g`uv__io_poll(loop=0x0000000102a07d60, timeout=0) at kqueue.c:349
    frame #16: 0x0000000101a44aef node_g`uv_run(loop=0x0000000102a07d60, mode=UV_RUN_DEFAULT) at core.c:368
    frame #17: 0x0000000101751b4a node_g`node::Start(isolate=0x0000000106000000, isolate_data=0x00007ffeefbfe290, argc=3, argv=0x0000000104d001d0, exec_argc=1, exec_argv=0x0000000104d002d0) at node.cc:4782
    frame #18: 0x000000010174122c node_g`node::Start(event_loop=0x0000000102a07d60, argc=3, argv=0x0000000104d001d0, exec_argc=1, exec_argv=0x0000000104d002d0) at node.cc:4849
    frame #19: 0x00000001017408e0 node_g`node::Start(argc=3, argv=0x0000000104d001d0) at node.cc:4906
    frame #20: 0x00000001017d898e node_g`main(argc=4, argv=0x00007ffeefbfeb88) at node_main.cc:106
    frame #21: 0x0000000100001634 node_g`start + 52

Unfortunately, as far as I can tell, this is just a stack trace for the process.onExit handler that fires as a result of the segmentation fault, rather than an indication of what caused the crash.

Fortunately, it's possible to run Node and Meteor from within lldb, and set a handler to catch the SIGSEGV signal. This is a bit tricky because you have to run

process handle SIGSEGV --notify true --pass true --stop true

after the process has started, as early as possible (before the crash happens). The trick is to set a breakpoint on the main function using br s -n main so that the process will pause as soon as it starts, then set the handler, then continue with the c command:

~/meteor% lldb /Users/ben/src/node/node_g 
(lldb) target create "/Users/ben/src/node/node_g"
Current executable set to '/Users/ben/src/node/node_g' (x86_64).
(lldb) br s -n main
Breakpoint 1: where = node_g`main + 38 at node_main.cc:104, address = 0x00000001017d8956
(lldb) process handle SIGSEGV --notify true --pass true --stop true
error: No current process; cannot handle signals until you have a valid process.
(lldb) run --expose-gc /Users/ben/meteor/tools/index.js self-test autoupdate
Process 14725 launched: '/Users/ben/src/node/node_g' (x86_64)
Process 14725 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001017d8956 node_g`main(argc=5, argv=0x00007ffeefbfeaa0) at node_main.cc:104
   101  #endif
   102    // Disable stdio buffering, it interacts poorly with printf()
   103    // calls elsewhere in the program (e.g., any logging from V8.)
-> 104    setvbuf(stdout, nullptr, _IONBF, 0);
   105    setvbuf(stderr, nullptr, _IONBF, 0);
   106    return node::Start(argc, argv);
   107  }
Target 0: (node_g) stopped.
(lldb) process handle SIGSEGV --notify true --pass true --stop true
NAME         PASS   STOP   NOTIFY
===========  =====  =====  ======
SIGSEGV      true   true   true 
(lldb) c
Process 14725 resuming

Unfortunately, I haven't been able to reproduce the crash while running under lldb. The autoupdate test (and many tests after it) pass, somehow.

Sorry that's a bit anti-climactic. We'll have to keep digging.

@vladholubiev
Copy link
Contributor

vladholubiev commented Mar 12, 2018

Other people including myself are experiencing this issue with node.js 8.10 as well

See nodejs/node#19274 and RocketChat/Rocket.Chat#10060 for details, looks like V8 issue?

@abernix
Copy link
Contributor

abernix commented Mar 14, 2018

Just as an update:

I currently have a git bisect start v8.10.0 v8.9.4 && git bisect run <shell script which builds each commit of Node in that range from source using ./scripts/generate-dev-bundle.sh and then spawns a './meteor self-test' and watches for a seg fault before moving to the next commit>, which has been running for some time now, that should eventually reveal which Node.js commit in particular has caused this to happen.

Each build from source takes quite some time, and there are a number of Node.js releases in that range that just will not build, but there are currently 19 revisions remaining. 😄 🤞

@lmachens
Copy link

I experienced this issue too. My Meteor 1.6.1 app was running on now with v8.10.0 by mistake and crashed every few minutes without an error message. See jkrup/meteor-now#101.

@abernix
Copy link
Contributor

abernix commented Mar 15, 2018

After working through a number of transitional non-buildable commits in the Node source bisection, my git bisect eventually returned:

The first bad commit could be any of:
5d80b0edd93bd9250a15fe126a2780153c17f8d8
16a980b4c4271902e50c99e66b5693f1d1556e38
ae8c83833928a24525595eb676857e8f3264affd
0b690a9ce3a10aa68aefbb50398758e52f50a331
711f344c2e44cbaad0486ed147e8622038049d27
0e30ca942e54833ce0fd02e6b1dc11aa90a20dfe
b71a33c2bf75314e9764908e9c1a3551f8ee58c3
be734c513c7c0bb480cb201b368576fcdcc6021a
ebee8edca2ca80c3f8ae0a3eeda00dbdaaf7c545
0ee645510d4f8e9e4c8a604b1ea1aff6c89fad01
a7fc12772d2644003005dabc390d6bab2022c344
0a064c4b6814a7865a09625d34ef9fea68ad9f7c
aa4f58a9a5ed63d4dbf01d3e1fef67f2912bcb5d
51ad36a901c2e8e88ef0197636becf774c0f6daf
805084b59dcb62cd20486a6a6259bf823f13d4bf
92a93c02c45da49fcadecb0f7c150fb5ed33c9b8
aae68d3ef09a86c3fe94bafe58d48bf14c53ca4e
2b84fa9514c929fabe7294de3b46bc5c72fbca1f
d3aa9eeb1d5048b4750211e8065f391cac23957e
758b730139c50f47de7f975af02e735cbf3f27cd
bede7a3cfa884c67cbb5d804a1a4f130bea4ac38
1e316826ffef162edb8cee8d2709ef9b19cf63f9

These 22 commits translate almost exactly to commits which appeared in the V8 upgrade to 6.2 which occurred in nodejs/node#16413. That PRs commit hashes landed in the v8.x branch as nodejs/node@072902a...2a2c881 (072902 is bede7a3's parent and bede7a3 is in the above list).

Unfortunately, I think this means that the problem is related to the way that V8 6.2 landed in Node v8.10.0 (not to be confused with V8). It's yet to be determined if that's because of a change in V8 that will cause us problems or because of incorrect application of the V8 updates to the Node source (V8 upgrades are intentionally grafted in a somewhat piecemeal manner onto the Node.js source, which maintains its own copy of V8).

This will require further investigation, but we are working on this.

@benjamn benjamn force-pushed the update-node-to-8.10.0-and-npm-to-5.7.1 branch from dee83b1 to 64dbd6f Compare March 17, 2018 21:13
@benjamn
Copy link
Contributor Author

benjamn commented Mar 19, 2018

I’ve just finished manually bisecting tagged V8 revisions between 6.1.534 (the version used by Node 8.9.4) and 6.2.414 (the version used by Node 8.10.0), a range that spans 948 commits.

At each step I ran

tools/release/update_node.py /Users/ben/src/v8 /Users/ben/src/node

to copy the V8 source code into my Node repository, which I first reset to v8.9.4 (the last stable version that shipped in Meteor 1.6.1).

I then built the resulting Node code using Ninja, which is considerably faster than make:

./configure --ninja
ninja -C out/Release

Especially in the beginning, many of these builds would fail because the V8 revision I had copied was in some weird limbo between working commits, but I gradually built up a script that git cherry-pick'd the necessary commits to fix those problems. Most of these commits did not apply cleanly, but I was able to apply the non-conflicting hunks somewhat reliably by passing the following options to git cherry-pick:

git cherry-pick <revision> --strategy=recursive -X theirs

Provided the ninja build succeeded, I patched these custom versions of Node into the Meteor dev_bundle by replacing dev_bundle/bin/node with a symlink to /Users/ben/src/node/out/Release/node, and rebuilt dev_bundle/lib/node_modules to recompile binary packages:

cd /path/to/meteor/dev_bundle/lib
../../meteor npm rebuild --build-from-source

I then ran

meteor self-test --file '^[a-c]'

to see if I could reproduce the segmentation fault, which occurred reliably during the autoupdate test in most cases.

This bisection had to be manual because I was only considering tagged V8 commits, in an attempt to reduce the chances that a commit chosen automatically by git bisect would fail to compile, and because there are considerably fewer tags than commits. Here are my notes from that process:
v8_6.2_manual_bisect.txt

As you can see from those notes, the problem appears to have been introduced between V8 versions 6.2.27 and 6.2.270. That's good news!

Bad news: there are 539 commits between those tags, which is more than half of the original 948 commits.

So where does that leave us? I will continue bisecting this range using the following command:

~/src/v8% git bisect start 6.2.270~1 6.2.27~1
Bisecting: 268 revisions left to test after this (roughly 8 steps)
[744b901d414be2b38a1053f498da3b711d97d2ca] [heap] Implement write barrier in code stub assembly

Should be fun!

@benjamn
Copy link
Contributor Author

benjamn commented Mar 19, 2018

And here it is!

# bad: [4db608d72485bf9e6179c1a11c3e69f5ec284492] [wasm] Delete redundant enumeration definition
# good: [851e8057e652d134fb032e42caff8b60fd40d363] [cleanup] Remove unused InternalPackedArray.
git bisect start '6.2.270~1' '6.2.27~1'
# good: [744b901d414be2b38a1053f498da3b711d97d2ca] [heap] Implement write barrier in code stub assembly
git bisect good 744b901d414be2b38a1053f498da3b711d97d2ca
# bad: [053918b35efab0d5d888fd37c5346f1b34b4aa29] PPC/s390: [turbofan] Properly check new.target parameter in inlined Reflect.construct.
git bisect bad 053918b35efab0d5d888fd37c5346f1b34b4aa29
# bad: [4455db16722d3fd501a1b940d17cd325f065c5e2] Reland "[heap] Improve concurrent marking pausing protocol."
git bisect bad 4455db16722d3fd501a1b940d17cd325f065c5e2
# good: [56f392292cbf5e343f080e6a924ee16001002f75] [heap] Enable compaction for concurrent marking.
git bisect good 56f392292cbf5e343f080e6a924ee16001002f75
# bad: [575ec86335a839660394f55a34c5e615d9c1b4f3] [wasm] Implement atomic logical BinOps
git bisect bad 575ec86335a839660394f55a34c5e615d9c1b4f3
# bad: [ea0e1e21ecc13884302e0c77edad67659f2e68b4] Fixing failure on GC stress.
git bisect bad ea0e1e21ecc13884302e0c77edad67659f2e68b4
# good: [15ef03cbf3a439a99966a6e718e99f0617a9f604] Reland "[builtins] Port getting property from Proxy to CSA"
git bisect good 15ef03cbf3a439a99966a6e718e99f0617a9f604
# good: [943651b789f8757fd86f133b606eb7d2a86bcc1b] Revert "Reland "[turbofan] enable new implementation of escape analysis""
git bisect good 943651b789f8757fd86f133b606eb7d2a86bcc1b
# good: [fd87a3c4236ed5bef4252818e40a38f020cdf671] [wasm] Remove redundant parameter
git bisect good fd87a3c4236ed5bef4252818e40a38f020cdf671
# first bad commit: [ea0e1e21ecc13884302e0c77edad67659f2e68b4] Fixing failure on GC stress.

Now we need to develop a theory as to why v8/v8@ea0e1e2 introduces this problem, or at least find something upstream that further refines/fixes this logic.

My theory: if code->InvalidateEmbeddedObjects() runs from a Fiber, it might get confused about what objects are reachable. If it starts from the fiber's execution stack rather than Node's default execution stack, it might invalidate more embedded objects than it should, because it mistakenly concludes they're unreachable (from the fiber's stack). References to some of those objects might be retained elsewhere in the program, and might cause a crash if used after they are freed. One such object that shows up repeatedly in core dump backtraces is the process._tickCallback function.

I can think of several ways to mitigate this problem, but we need to confirm that's what's happening before we try to implement any solutions.

@abernix
Copy link
Contributor

abernix commented Mar 19, 2018

The commit message on v8/v8@ea0e1e2 indicates that it was meant to fix a bug originally introduced in v8/v8@e15f554, but that commit was actually reverted by v8/v8@a193fde.

I may be misreading the commit history, but if that follow-up commit is still in place, but the code which introduced the bug it was fixing was reverted, then v8/v8@ea0e1e2 should have also been reverted, but I don't see that ever happening — up until and including the current state of master.

What appears like another attempt to re-land the changes originally committed in v8/v8@e15f554 showed up in v8/v8@a01ac7c however that too was reverted in v8/v8@3138850.

Honestly, I think the change in v8/v8@ea0e1e2 should have been reverted, at least at the point and time we're at on V8, barring any future updates to V8 that made it acceptable. Looking at the commit, it appears to have been moved outside an Isolate-aware section of the code, and that could be a contributing factor.

@benjamn
Copy link
Contributor Author

benjamn commented Mar 19, 2018

I can confirm that reverting v8/v8@ea0e1e2 starting from the v8.10.0 branch of Node (with no other changes) fixes the segmentation fault in every case in which I'm aware of it previously failing.

@benjamn
Copy link
Contributor Author

benjamn commented Mar 19, 2018

Oh geez, I just realized why there were so many commits between 6.2.27 and 6.2.270… the release tags were sorted lexicographically here, even though they should be interpreted numerically, like any reasonable versioning system. 🤦‍♂️

Update: the problem was actually introduced between 6.2.145 and 6.2.146, a range that only contains 42 commits. Updated nodejs/node#19274 (comment) to remove any confusion about this mistake.

@benjamn
Copy link
Contributor Author

benjamn commented Mar 29, 2018

We're going to jump to Node 8.11.1 instead of 8.10.0, because of the segmentation fault problem.

New PR for updating npm to version 5.8.0: #9778

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants