Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fix crash when channel is removed but _notify_queue still has events #100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tetron
Copy link

@tetron tetron commented Mar 15, 2024

After llfuse.main() exits the handler loop but before it returns, wait for the _notify_loop thread to terminate.

If we don't do this, when llfuse.close() calls fuse_session_remove_chan() while _notify_loop is still running in a separate thread, it will segfault when it attempts to use the channel that has just been removed. This is especially a problem with unit test suites that set up and tear down the FUSE mount repeatedly in the same process.

tetron added 2 commits March 12, 2024 20:07
When fuse_session_remove_chan is called, the _notify_loop thread _must
not_ be running or it may try to use the deleted channel and cause a
segfault.
Intended to speed up the notify queue shut down if there are pending
events when running tests that repeatedly create and destroy llfuse
sessions.
@ThomasWaldmann
Copy link
Collaborator

@tetron Thanks for the PR!

I never noticed an issue with this - how did you find this?

@tetron
Copy link
Author

tetron commented Mar 15, 2024

My best guess is that if you are not sending a lot of invalidation notifications back to the kernel, then the notify queue will be empty and you don't have a problem.

My filesystem is in front of a web service does asynchronous cleanup to control memory usage (because storing all filesystem metadata is >>> a reasonable amount of RAM, never mind file data itself). So that is happening independently of the handlers, sending lots of invalidation notifications back to the kernel so it correctly forgets about filesystem entries that I want to purge from RAM. So the invalidation notification queue sees a lot of activity. Doing an unmount with pending events results in the crash.

This is also happening in a test suite where it mounts and unmounts the filesystem repeatedly so it has lots of opportunities to hit the right conditions to trigger the crash.

I debugged it by looking at the core dump with "gdb" and even without set of debugging symbols I had enough of a stack trace to read the code and figure out what was going on.

All that said, our test suite was passing without this fix, but started crashing when I made some other changes. I did have a bit of a "how did this ever work" moment, because the bug has always been there.

@ThomasWaldmann
Copy link
Collaborator

Interesting. Could there be an easy test in our testsuite that fails without your changes and succeeds after them?

I assume you tested the changes in this PR in production, where you experienced the issues?

@tetron
Copy link
Author

tetron commented Mar 15, 2024

I can look at adding a test. I believe filling the queue with kernel invalidation notifications and then immediately trying to unmount the filesystem should produce the crash but since it is a race condition it may not happen every time.

We saw these issues in our test suite. I tried very hard to work around it on our side but in the end fixing llfuse was the only thing that worked. With this fix, our test suite no longer randomly segfaults.

@tetron
Copy link
Author

tetron commented Mar 15, 2024

(Our FUSE driver is here: https://github.com/arvados/arvados/tree/main/services/fuse)

@ThomasWaldmann
Copy link
Collaborator

@Nikratio can you review this please?

@@ -267,7 +267,11 @@ def _notify_loop():
while True:
req = _notify_queue.get()
if req is None:
return
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change has any semantic effect...?


if _notify_queue_shutdown.is_set():
# Just drain the queue
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not break/return?

@Nikratio
Copy link
Contributor

I don't see any harm in this patch, but it also seems like it's doing much more than needed. Are you sure that it isn't sufficient to add the join call (with no other changes)?

t = threading.Thread(target=_notify_loop)
t.daemon = True
t.start()
on_exit.callback(_notify_queue_shutdown.set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this in addition to the sentinel object? And if the sentinel method doesn't work, shouldn't we replace it with the Event instead of doing both?

@@ -313,6 +315,7 @@ def main(workers=None, handle_signals=True):
session_loop_single()
else:
session_loop_mt(workers)
t.join()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be a good idea to leave a comment here (in the code) on why this is needed, or someone may remove this again at some point as "not needed".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants