Fix crash when channel is removed but _notify_queue still has events #100

tetron · 2024-03-15T14:22:00Z

After llfuse.main() exits the handler loop but before it returns, wait for the _notify_loop thread to terminate.

If we don't do this, when llfuse.close() calls fuse_session_remove_chan() while _notify_loop is still running in a separate thread, it will segfault when it attempts to use the channel that has just been removed. This is especially a problem with unit test suites that set up and tear down the FUSE mount repeatedly in the same process.

When fuse_session_remove_chan is called, the _notify_loop thread _must not_ be running or it may try to use the deleted channel and cause a segfault.

Intended to speed up the notify queue shut down if there are pending events when running tests that repeatedly create and destroy llfuse sessions.

ThomasWaldmann · 2024-03-15T14:41:54Z

@tetron Thanks for the PR!

I never noticed an issue with this - how did you find this?

tetron · 2024-03-15T15:30:48Z

My best guess is that if you are not sending a lot of invalidation notifications back to the kernel, then the notify queue will be empty and you don't have a problem.

My filesystem is in front of a web service does asynchronous cleanup to control memory usage (because storing all filesystem metadata is >>> a reasonable amount of RAM, never mind file data itself). So that is happening independently of the handlers, sending lots of invalidation notifications back to the kernel so it correctly forgets about filesystem entries that I want to purge from RAM. So the invalidation notification queue sees a lot of activity. Doing an unmount with pending events results in the crash.

This is also happening in a test suite where it mounts and unmounts the filesystem repeatedly so it has lots of opportunities to hit the right conditions to trigger the crash.

I debugged it by looking at the core dump with "gdb" and even without set of debugging symbols I had enough of a stack trace to read the code and figure out what was going on.

All that said, our test suite was passing without this fix, but started crashing when I made some other changes. I did have a bit of a "how did this ever work" moment, because the bug has always been there.

ThomasWaldmann · 2024-03-15T15:58:52Z

Interesting. Could there be an easy test in our testsuite that fails without your changes and succeeds after them?

I assume you tested the changes in this PR in production, where you experienced the issues?

tetron · 2024-03-15T16:47:24Z

I can look at adding a test. I believe filling the queue with kernel invalidation notifications and then immediately trying to unmount the filesystem should produce the crash but since it is a race condition it may not happen every time.

We saw these issues in our test suite. I tried very hard to work around it on our side but in the end fixing llfuse was the only thing that worked. With this fix, our test suite no longer randomly segfaults.

tetron · 2024-03-15T16:48:13Z

(Our FUSE driver is here: https://github.com/arvados/arvados/tree/main/services/fuse)

ThomasWaldmann · 2024-08-31T12:46:14Z

@Nikratio can you review this please?

Nikratio · 2024-08-31T13:34:36Z

src/misc.pxi

@@ -267,7 +267,11 @@ def _notify_loop():
    while True:
        req = _notify_queue.get()
        if req is None:
-            return
+            break


I don't think this change has any semantic effect...?

Nikratio · 2024-08-31T13:34:50Z

src/misc.pxi

+
+        if _notify_queue_shutdown.is_set():
+            # Just drain the queue
+            continue


Why not break/return?

Nikratio · 2024-08-31T13:37:09Z

I don't see any harm in this patch, but it also seems like it's doing much more than needed. Are you sure that it isn't sufficient to add the join call (with no other changes)?

Nikratio · 2024-08-31T13:38:24Z

src/fuse_api.pxi

        t = threading.Thread(target=_notify_loop)
        t.daemon = True
        t.start()
+        on_exit.callback(_notify_queue_shutdown.set)


Why do we need this in addition to the sentinel object? And if the sentinel method doesn't work, shouldn't we replace it with the Event instead of doing both?

Nikratio · 2024-08-31T13:39:02Z

src/fuse_api.pxi

@@ -313,6 +315,7 @@ def main(workers=None, handle_signals=True):
            session_loop_single()
        else:
            session_loop_mt(workers)
+    t.join()


Would be a good idea to leave a comment here (in the code) on why this is needed, or someone may remove this again at some point as "not needed".

tetron added 2 commits March 12, 2024 20:07

Wait for notify thread to end before llfuse.main() completes.

7b6a47a

When fuse_session_remove_chan is called, the _notify_loop thread _must not_ be running or it may try to use the deleted channel and cause a segfault.

Use an event variable to tell the notify queue to drain

db9a169

Intended to speed up the notify queue shut down if there are pending events when running tests that repeatedly create and destroy llfuse sessions.

Nikratio reviewed Aug 31, 2024

View reviewed changes

src/misc.pxi

if _notify_queue_shutdown.is_set():

# Just drain the queue

continue

Copy link

Contributor

Nikratio Aug 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not break/return?

Nikratio reviewed Aug 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix crash when channel is removed but _notify_queue still has events #100

Fix crash when channel is removed but _notify_queue still has events #100

Uh oh!

tetron commented Mar 15, 2024 •

edited

Loading

Uh oh!

ThomasWaldmann commented Mar 15, 2024

Uh oh!

tetron commented Mar 15, 2024

Uh oh!

ThomasWaldmann commented Mar 15, 2024

Uh oh!

tetron commented Mar 15, 2024

Uh oh!

tetron commented Mar 15, 2024

Uh oh!

ThomasWaldmann commented Aug 31, 2024

Uh oh!

Nikratio Aug 31, 2024

Uh oh!

Nikratio Aug 31, 2024

Uh oh!

Nikratio commented Aug 31, 2024

Uh oh!

Nikratio Aug 31, 2024

Uh oh!

Nikratio Aug 31, 2024

Uh oh!

Uh oh!

Fix crash when channel is removed but _notify_queue still has events #100

Are you sure you want to change the base?

Fix crash when channel is removed but _notify_queue still has events #100

Uh oh!

Conversation

tetron commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ThomasWaldmann commented Mar 15, 2024

Uh oh!

tetron commented Mar 15, 2024

Uh oh!

ThomasWaldmann commented Mar 15, 2024

Uh oh!

tetron commented Mar 15, 2024

Uh oh!

tetron commented Mar 15, 2024

Uh oh!

ThomasWaldmann commented Aug 31, 2024

Uh oh!

Nikratio Aug 31, 2024

Choose a reason for hiding this comment

Uh oh!

Nikratio Aug 31, 2024

Choose a reason for hiding this comment

Uh oh!

Nikratio commented Aug 31, 2024

Uh oh!

Nikratio Aug 31, 2024

Choose a reason for hiding this comment

Uh oh!

Nikratio Aug 31, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tetron commented Mar 15, 2024 •

edited

Loading