-
Notifications
You must be signed in to change notification settings - Fork 8.9k
bugfix: prevent Netty I/O thread blocking by async channel release via reconnectExecutor #7505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 2.x #7505 +/- ##
============================================
+ Coverage 60.50% 60.61% +0.11%
Complexity 658 658
============================================
Files 1307 1307
Lines 49359 49377 +18
Branches 5805 5805
============================================
+ Hits 29865 29932 +67
+ Misses 16848 16791 -57
- Partials 2646 2654 +8
🚀 New features to boost your workflow:
|
e24feeb to
b60f6d1
Compare
| new NamedThreadFactory(getThreadPrefix(), MAX_MERGE_SEND_THREAD)); | ||
| mergeSendExecutorService.submit(new MergedSendRunnable()); | ||
| } | ||
| reconnectExecutor = new ThreadPoolExecutor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the timerExecutor thread pool directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @funky-eyes
I added a separate thread pool mainly to give it a clear thread name, so it’s easier to trace when reconnect-related issues happen. But if that feels unnecessary here, I’m happy to switch to using the existing timerExecutor. Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, creating a new thread pool is generally preferred when there’s a high-priority task or when real-time processing is critical. Also, how frequently the issue occurs can be an important factor in deciding whether to create a new thread pool.
From a simple traceability standpoint, the current timeoutExecutor is used in several places — such as handling reconnections and removing timed-out messages.
So, it might make sense to update the prefix to something more general that fits well across all these usages, and reuse it accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @funky-eyes I added a separate thread pool mainly to give it a clear thread name, so it’s easier to trace when reconnect-related issues happen. But if that feels unnecessary here, I’m happy to switch to using the existing
timerExecutor. Let me know what you think!
I wrote very clearly in the issue why we should reuse the timeoutExecutor. I suggest you can take a look at the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. I revisited the issue and I now see your point more clearly — reusing timeoutExecutor makes sense, especially since it’s already used for similar tasks like handling reconnections and timeouts. Avoiding an extra thread pool also helps keep things lean and easier to manage.
I’ll go ahead and update the code to reuse timeoutExecutor accordingly. Appreciate the feedback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve finished the changes. Would you mind taking a look when you have time?
@funky-eyes @YongGoose
6a1dfca to
0489f84
Compare
funky-eyes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| handler.exceptionCaught(mockCtx, new IllegalArgumentException("test")); | ||
|
|
||
| Thread.sleep(500); | ||
| verify(spyManager).releaseChannel(eq(channel), anyString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason you used anyString()?
It seems like serverAddress will always be 127.0.0.1:8091, so I’m wondering if matching the exact value would be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re right — that makes sense to me as well. I’ll go ahead and update it to match the exact value instead of using anyString().
|
This modification cannot solve the problem. I think you can reproduce the problem first, then analyze the cause, and retest after modification. |
|
To reproduce, start three tc nodes 8091/8092/8093, start a business-xa application, observe the corresponding tcp connection port, then kill 8091, and observe whether the tcp connection is disconnected and reconnected |
|
After resolving the disconnection and reconnection, perform a stress test to see if the TPS is 0 during the shutdown period after killing one of the tc nodes. |
funky-eyes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All places in ChannelHandler that call the clientChannelManager.releaseChannel method need to be made asynchronous.
15a1116 to
e31f5ab
Compare
|
18.18% of diff hit (target 60.50%) |
|
Sure, I’ll take care of it! |
…a reconnectExecutor
|
I’ve finished the changes. Would you mind taking a look when you have time? |
…a reconnectExecutor (apache#7505)
…a reconnectExecutor (apache#7505)
Ⅰ. Describe what this PR did
Fixes a potential Netty I/O thread blocking issue by executing
releaseChannel()asynchronously via a dedicatedreconnectExecutorthread pool.Also ensures proper shutdown of
reconnectExecutorto avoid thread leaks.Ⅱ. Does this pull request fix one issue?
fixes #7497
Ⅲ. Why don't you add test cases (unit test/integration test)?
Ⅳ. Describe how to verify it
Ⅴ. Special notes for reviews
rpcReconnectExecutor) aligned with existing merge thread patterns.reconnectExecutoris now managed with proper init and destroy lifecycle hooks.