Fix: Prevent session manager shutdown on individual session crash #841

soby · 2025-05-29T17:29:31Z

Previously, an unhandled exception within a single MCP session's MCPServer.run() task could propagate to the StreamableHTTPSessionManager's main task group. This would cause the entire task group to cancel, effectively shutting down the session manager and terminating all active sessions.

This commit addresses the issue by:

Wrapping the self.app.run(...) call within the run_server (for stateful requests) and run_stateless_server (for stateless requests) inner functions in StreamableHTTPSessionManager with a try...except Exception block.
Logging any caught exceptions along with the session ID (for stateful requests) to aid in debugging the crashed session.

This change ensures that if a single session encounters an unexpected error and crashes, it only affects that specific session. The StreamableHTTPSessionManager will continue to run, and other active sessions will remain operational. This significantly improves the robustness and availability of the server.

Motivation and Context

Unhandled exceptions such as a network error would render the server unusable until restart

How Has This Been Tested?

Yes, specifically by generating client disconnects and observing the server log the unhandled error but remain running and stable

Breaking Changes

No breaking changes

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the MCP Documentation
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have added or updated documentation as needed

Additional context

soby · 2025-05-29T18:05:15Z

To address a test failure, I updated the cleanup logic to only remove error sessions and not those that have been explicitly terminated. I see that the terminated sessions are retained and used for the 404 vs 400 return code logic. However, I do not see any place where those are ever removed so I imagine they just accumulate until server restart. It's not within the scope of this PR to address but that's likely not desirable behavior.

UYousafzai · 2025-06-04T12:14:45Z

any update on this @soby? maybe they implemented this in some other issue perhaps?

soby · 2025-06-04T13:36:28Z

@UYousafzai it's ready to merge IMO. I've been running it in a fork and it works as intended to keep the server up and stable. Hopefully it can make it into tomorrow's release. @ihrpr ?

…nects

…python-sdk into fix/session-manager-resilience

NAVNAV221 · 2025-06-05T16:39:54Z

src/mcp/server/streamable_http_manager.py

-                    self.app.create_initialization_options(),
-                    stateless=True,
-                )
+                try:


Did we look into .run function to see how to handle the error there?

@NAVNAV221 That should also happen, but there should be no scenarios in which a per-session error is allowed to destabilize the entire server until reboot. This is a catch-all to make sure that the server as a whole survives any errors where proper error handling was missed.

Sillocan · 2025-06-08T18:13:38Z

Fairly certain this is a duplicate of #820 and #822 which is associated with a hackerone issue

soby · 2025-06-10T19:50:31Z

@Sillocan Those are for handling of specific exception sources within a connection (what @NAVNAV221 was referencing). This PR implements catch-all exception handling for all stream/sessions such that a future unhandled exceptions within the stream/session are logged but do not corrupt the whole server until restarted

soby · 2025-06-15T23:46:57Z

@NAVNAV221 Is this good to merge?

bendavis78 · 2025-06-22T15:58:52Z

src/mcp/server/streamable_http_manager.py

+                                stateless=False,  # Stateful mode
+                            )
+                        except Exception as e:
+                            logger.warning(


Why are you using log.warning for the crash exception? Seems like it should be log.error?

I originally logged it at error but it was lighting up my bug reporting system with uncaught exceptions related to trivial things like the client closing the tcp connection unexpectedly. "Be usable in the real world" has to take priority.

felixweinberger

Thank you for working on this and adding comprehensive tests! try-catching the app.runs() makes sense to me to stop errors propagating to other concurrent requests.

For the changes to logging severity - did you see any specific errors that were flooding your terminal that we might be able to isolate?

I think we may miss important errors this way if we hide them. Could we instead do something like this:

  except anyio.ClosedResourceError:
      # Expected when client disconnects - this is normal
      logger.warning("Client disconnected from SSE stream")
  except asyncio.CancelledError:
      # Expected during shutdown
      logger.warning("SSE writer cancelled")
      raise  # Must re-raise cancellation!
  except Exception as e:
      # Unexpected error - this is a real problem!
      logger.exception(f"Unexpected error in SSE writer: {e}")

That would help us clearly see the difference between expected errors and actual errors that we should look into.

felixweinberger · 2025-06-23T15:27:54Z

tests/server/test_streamable_http_manager.py

+    async def mock_send(message):
+        sent_messages.append(message)
+
+    scope = {"type": "http", "method": "POST", "path": "/mcp", "headers": []}


I think you need the headers here to make CI pass:

scope = { "type": "http", "method": "POST", "path": "/mcp", "headers": [(b"content-type", b"application/json")], }

felixweinberger · 2025-06-23T15:29:01Z

tests/server/test_streamable_http_manager.py

+        if message["type"] == "http.response.start" and message["status"] >= 500:
+            pass  # Expected if TestException propagates that far up the transport
+
+    scope = {"type": "http", "method": "POST", "path": "/mcp", "headers": []}


I think you need the headers here to make CI pass:

scope = { "type": "http", "method": "POST", "path": "/mcp", "headers": [(b"content-type", b"application/json")], }

felixweinberger · 2025-06-23T16:01:58Z

src/mcp/server/streamable_http_manager.py

+                                http_transport.mcp_session_id
+                                and http_transport.mcp_session_id in self._server_instances
+                                and not (
+                                    hasattr(http_transport, "_terminated") and http_transport._terminated  # pyright: ignore


To avoid having to override the linter, should we have http_transport expose a public property for checking termination state rather than directly access a private property?

# streamable_http.py @property def is_terminated(self) -> bool: """Check if this transport has been explicitly terminated.""" return getattr(self, '_terminated', False) # here if ( http_transport.mcp_session_id and http_transport.mcp_session_id in self._server_instances and not http_transport.is_terminated ):

felixweinberger · 2025-06-23T16:21:40Z

tests/server/test_streamable_http_manager.py

+    # It's possible handle_request itself might raise an error if the TestException
+    # isn't caught by the transport layer before propagating.
+    # The key is that the session manager's internal task for MCPServer.run
+    # encounters the exception.
+    try:
+        await manager.handle_request(scope, mock_receive, mock_send)
+    except TestException:
+        # This might be caught here if not handled by StreamableHTTPServerTransport's
+        # error handling
+        pass


Do we actually need this try catch and explanation? I ran this test locally 50x without the Try/Catch and the test worked fine without it. Can we remove the try catch and comment?

When would the exception not be caught by the app.run try-catch you added in the manager?

soby added 2 commits May 29, 2025 11:25

Fix: Prevent session manager shutdown on individual session crash

694cabc

only remove error sessions and not those that have not been terminated

a804442

lint

4026edf

soby mentioned this pull request May 29, 2025

RuntimeError: Task group is not initialized in the simplest possible MCP server with default config #838

Open

soby and others added 2 commits June 2, 2025 16:02

step down logging from error to warning

7671963

Merge branch 'main' into fix/session-manager-resilience

2a27e65

soby added 2 commits June 4, 2025 13:19

step down logging for other places subject to network closures/discon…

b12b370

…nects

Merge branch 'fix/session-manager-resilience' of github.com:soby/mcp-…

8067bc9

…python-sdk into fix/session-manager-resilience

NAVNAV221 reviewed Jun 5, 2025

View reviewed changes

Merge branch 'main' into fix/session-manager-resilience

ec85fa3

soby and others added 3 commits June 15, 2025 17:32

Merge branch 'main' into fix/session-manager-resilience

ab0c809

ruff

131a58d

linter differences of opinion

6165a30

Merge branch 'main' into fix/session-manager-resilience

e25081b

ihrpr added this to the HPR milestone Jun 16, 2025

soby added 2 commits June 17, 2025 11:27

Merge branch 'main' into fix/session-manager-resilience

61f3dc2

Merge branch 'main' into fix/session-manager-resilience

208f13b

bendavis78 reviewed Jun 22, 2025

View reviewed changes

felixweinberger requested changes Jun 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Prevent session manager shutdown on individual session crash #841

Fix: Prevent session manager shutdown on individual session crash #841

Uh oh!

soby commented May 29, 2025

Uh oh!

soby commented May 29, 2025

Uh oh!

UYousafzai commented Jun 4, 2025

Uh oh!

soby commented Jun 4, 2025

Uh oh!

NAVNAV221 Jun 5, 2025

Uh oh!

soby Jun 5, 2025

Uh oh!

Sillocan commented Jun 8, 2025

Uh oh!

soby commented Jun 10, 2025

Uh oh!

soby commented Jun 15, 2025

Uh oh!

bendavis78 Jun 22, 2025

Uh oh!

soby Jun 22, 2025

Uh oh!

felixweinberger left a comment

Uh oh!

felixweinberger Jun 23, 2025

Uh oh!

felixweinberger Jun 23, 2025

Uh oh!

felixweinberger Jun 23, 2025

Uh oh!

felixweinberger Jun 23, 2025

Uh oh!

Uh oh!

Fix: Prevent session manager shutdown on individual session crash #841

Are you sure you want to change the base?

Fix: Prevent session manager shutdown on individual session crash #841

Uh oh!

Conversation

soby commented May 29, 2025

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

Uh oh!

soby commented May 29, 2025

Uh oh!

UYousafzai commented Jun 4, 2025

Uh oh!

soby commented Jun 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sillocan commented Jun 8, 2025

Uh oh!

soby commented Jun 10, 2025

Uh oh!

soby commented Jun 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixweinberger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!