-
Notifications
You must be signed in to change notification settings - Fork 29
Fix crash when channel is removed but _notify_queue still has events #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
When fuse_session_remove_chan is called, the _notify_loop thread _must not_ be running or it may try to use the deleted channel and cause a segfault.
Intended to speed up the notify queue shut down if there are pending events when running tests that repeatedly create and destroy llfuse sessions.
@tetron Thanks for the PR! I never noticed an issue with this - how did you find this? |
My best guess is that if you are not sending a lot of invalidation notifications back to the kernel, then the notify queue will be empty and you don't have a problem. My filesystem is in front of a web service does asynchronous cleanup to control memory usage (because storing all filesystem metadata is >>> a reasonable amount of RAM, never mind file data itself). So that is happening independently of the handlers, sending lots of invalidation notifications back to the kernel so it correctly forgets about filesystem entries that I want to purge from RAM. So the invalidation notification queue sees a lot of activity. Doing an unmount with pending events results in the crash. This is also happening in a test suite where it mounts and unmounts the filesystem repeatedly so it has lots of opportunities to hit the right conditions to trigger the crash. I debugged it by looking at the core dump with "gdb" and even without set of debugging symbols I had enough of a stack trace to read the code and figure out what was going on. All that said, our test suite was passing without this fix, but started crashing when I made some other changes. I did have a bit of a "how did this ever work" moment, because the bug has always been there. |
Interesting. Could there be an easy test in our testsuite that fails without your changes and succeeds after them? I assume you tested the changes in this PR in production, where you experienced the issues? |
I can look at adding a test. I believe filling the queue with kernel invalidation notifications and then immediately trying to unmount the filesystem should produce the crash but since it is a race condition it may not happen every time. We saw these issues in our test suite. I tried very hard to work around it on our side but in the end fixing llfuse was the only thing that worked. With this fix, our test suite no longer randomly segfaults. |
(Our FUSE driver is here: https://github.com/arvados/arvados/tree/main/services/fuse) |
@Nikratio can you review this please? |
@@ -267,7 +267,11 @@ def _notify_loop(): | |||
while True: | |||
req = _notify_queue.get() | |||
if req is None: | |||
return | |||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this change has any semantic effect...?
|
||
if _notify_queue_shutdown.is_set(): | ||
# Just drain the queue | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not break/return?
I don't see any harm in this patch, but it also seems like it's doing much more than needed. Are you sure that it isn't sufficient to add the |
t = threading.Thread(target=_notify_loop) | ||
t.daemon = True | ||
t.start() | ||
on_exit.callback(_notify_queue_shutdown.set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this in addition to the sentinel object? And if the sentinel method doesn't work, shouldn't we replace it with the Event instead of doing both?
@@ -313,6 +315,7 @@ def main(workers=None, handle_signals=True): | |||
session_loop_single() | |||
else: | |||
session_loop_mt(workers) | |||
t.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be a good idea to leave a comment here (in the code) on why this is needed, or someone may remove this again at some point as "not needed".
After llfuse.main() exits the handler loop but before it returns, wait for the _notify_loop thread to terminate.
If we don't do this, when llfuse.close() calls
fuse_session_remove_chan()
while_notify_loop
is still running in a separate thread, it will segfault when it attempts to use the channel that has just been removed. This is especially a problem with unit test suites that set up and tear down the FUSE mount repeatedly in the same process.