-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix: Prevent session manager shutdown on individual session crash #841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix: Prevent session manager shutdown on individual session crash #841
Conversation
To address a test failure, I updated the cleanup logic to only remove error sessions and not those that have been explicitly terminated. I see that the terminated sessions are retained and used for the 404 vs 400 return code logic. However, I do not see any place where those are ever removed so I imagine they just accumulate until server restart. It's not within the scope of this PR to address but that's likely not desirable behavior. |
any update on this @soby? maybe they implemented this in some other issue perhaps? |
@UYousafzai it's ready to merge IMO. I've been running it in a fork and it works as intended to keep the server up and stable. Hopefully it can make it into tomorrow's release. @ihrpr ? |
…python-sdk into fix/session-manager-resilience
self.app.create_initialization_options(), | ||
stateless=True, | ||
) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we look into .run
function to see how to handle the error there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NAVNAV221 That should also happen, but there should be no scenarios in which a per-session error is allowed to destabilize the entire server until reboot. This is a catch-all to make sure that the server as a whole survives any errors where proper error handling was missed.
@Sillocan Those are for handling of specific exception sources within a connection (what @NAVNAV221 was referencing). This PR implements catch-all exception handling for all stream/sessions such that a future unhandled exceptions within the stream/session are logged but do not corrupt the whole server until restarted |
@NAVNAV221 Is this good to merge? |
stateless=False, # Stateful mode | ||
) | ||
except Exception as e: | ||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using log.warning
for the crash exception? Seems like it should be log.error
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally logged it at error but it was lighting up my bug reporting system with uncaught exceptions related to trivial things like the client closing the tcp connection unexpectedly. "Be usable in the real world" has to take priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this and adding comprehensive tests! try-catching
the app.runs()
makes sense to me to stop errors propagating to other concurrent requests.
For the changes to logging severity - did you see any specific errors that were flooding your terminal that we might be able to isolate?
I think we may miss important errors this way if we hide them. Could we instead do something like this:
except anyio.ClosedResourceError:
# Expected when client disconnects - this is normal
logger.warning("Client disconnected from SSE stream")
except asyncio.CancelledError:
# Expected during shutdown
logger.warning("SSE writer cancelled")
raise # Must re-raise cancellation!
except Exception as e:
# Unexpected error - this is a real problem!
logger.exception(f"Unexpected error in SSE writer: {e}")
That would help us clearly see the difference between expected errors and actual errors that we should look into.
async def mock_send(message): | ||
sent_messages.append(message) | ||
|
||
scope = {"type": "http", "method": "POST", "path": "/mcp", "headers": []} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need the headers here to make CI pass:
scope = {
"type": "http",
"method": "POST",
"path": "/mcp",
"headers": [(b"content-type", b"application/json")],
}
if message["type"] == "http.response.start" and message["status"] >= 500: | ||
pass # Expected if TestException propagates that far up the transport | ||
|
||
scope = {"type": "http", "method": "POST", "path": "/mcp", "headers": []} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need the headers here to make CI pass:
scope = {
"type": "http",
"method": "POST",
"path": "/mcp",
"headers": [(b"content-type", b"application/json")],
}
http_transport.mcp_session_id | ||
and http_transport.mcp_session_id in self._server_instances | ||
and not ( | ||
hasattr(http_transport, "_terminated") and http_transport._terminated # pyright: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid having to override the linter, should we have http_transport
expose a public property for checking termination state rather than directly access a private property?
# streamable_http.py
@property
def is_terminated(self) -> bool:
"""Check if this transport has been explicitly terminated."""
return getattr(self, '_terminated', False)
# here
if (
http_transport.mcp_session_id
and http_transport.mcp_session_id in self._server_instances
and not http_transport.is_terminated
):
# It's possible handle_request itself might raise an error if the TestException | ||
# isn't caught by the transport layer before propagating. | ||
# The key is that the session manager's internal task for MCPServer.run | ||
# encounters the exception. | ||
try: | ||
await manager.handle_request(scope, mock_receive, mock_send) | ||
except TestException: | ||
# This might be caught here if not handled by StreamableHTTPServerTransport's | ||
# error handling | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need this try catch and explanation? I ran this test locally 50x without the Try/Catch and the test worked fine without it. Can we remove the try catch and comment?
When would the exception not be caught by the app.run try-catch you added in the manager?
Previously, an unhandled exception within a single MCP session's
MCPServer.run()
task could propagate to theStreamableHTTPSessionManager
's main task group. This would cause the entire task group to cancel, effectively shutting down the session manager and terminating all active sessions.This commit addresses the issue by:
self.app.run(...)
call within therun_server
(for stateful requests) andrun_stateless_server
(for stateless requests) inner functions inStreamableHTTPSessionManager
with atry...except Exception
block.This change ensures that if a single session encounters an unexpected error and crashes, it only affects that specific session. The
StreamableHTTPSessionManager
will continue to run, and other active sessions will remain operational. This significantly improves the robustness and availability of the server.Motivation and Context
Unhandled exceptions such as a network error would render the server unusable until restart
How Has This Been Tested?
Yes, specifically by generating client disconnects and observing the server log the unhandled error but remain running and stable
Breaking Changes
No breaking changes
Types of changes
Checklist
Additional context