bugfix: prevent Netty I/O thread blocking by async channel release via reconnectExecutor #7505

Dltmd202 · 2025-07-06T22:14:56Z

I have registered the PR changes.

Ⅰ. Describe what this PR did

Fixes a potential Netty I/O thread blocking issue by executing releaseChannel() asynchronously via a dedicated reconnectExecutor thread pool.

Also ensures proper shutdown of reconnectExecutor to avoid thread leaks.

Ⅱ. Does this pull request fix one issue?

fixes #7497

Ⅲ. Why don't you add test cases (unit test/integration test)?

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

Naming conventions (rpcReconnectExecutor) aligned with existing merge thread patterns.
reconnectExecutor is now managed with proper init and destroy lifecycle hooks.

codecov · 2025-07-06T22:34:56Z

Codecov Report

Attention: Patch coverage is 63.63636% with 8 lines in your changes missing coverage. Please review.

Project coverage is 60.61%. Comparing base (9f39706) to head (2a82afb).
Report is 1 commits behind head on 2.x.

Files with missing lines	Patch %	Lines
...ta/core/rpc/netty/AbstractNettyRemotingClient.java	63.63%	8 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##                2.x    #7505      +/-   ##
============================================
+ Coverage     60.50%   60.61%   +0.11%     
  Complexity      658      658              
============================================
  Files          1307     1307              
  Lines         49359    49377      +18     
  Branches       5805     5805              
============================================
+ Hits          29865    29932      +67     
+ Misses        16848    16791      -57     
- Partials       2646     2654       +8

Files with missing lines	Coverage Δ
...ta/core/rpc/netty/AbstractNettyRemotingClient.java	`44.54% <63.63%> (+6.92%)`	⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

funky-eyes · 2025-07-07T01:37:06Z

core/src/main/java/org/apache/seata/core/rpc/netty/AbstractNettyRemotingClient.java

                    new NamedThreadFactory(getThreadPrefix(), MAX_MERGE_SEND_THREAD));
            mergeSendExecutorService.submit(new MergedSendRunnable());
        }
+        reconnectExecutor = new ThreadPoolExecutor(


Why not use the timerExecutor thread pool directly?

Hi @funky-eyes
I added a separate thread pool mainly to give it a clear thread name, so it’s easier to trace when reconnect-related issues happen. But if that feels unnecessary here, I’m happy to switch to using the existing timerExecutor. Let me know what you think!

IMO, creating a new thread pool is generally preferred when there’s a high-priority task or when real-time processing is critical. Also, how frequently the issue occurs can be an important factor in deciding whether to create a new thread pool.

From a simple traceability standpoint, the current timeoutExecutor is used in several places — such as handling reconnections and removing timed-out messages.

So, it might make sense to update the prefix to something more general that fits well across all these usages, and reuse it accordingly.

Hi @funky-eyes I added a separate thread pool mainly to give it a clear thread name, so it’s easier to trace when reconnect-related issues happen. But if that feels unnecessary here, I’m happy to switch to using the existing timerExecutor. Let me know what you think!

I wrote very clearly in the issue why we should reuse the timeoutExecutor. I suggest you can take a look at the reason.

Thanks for pointing that out. I revisited the issue and I now see your point more clearly — reusing timeoutExecutor makes sense, especially since it’s already used for similar tasks like handling reconnections and timeouts. Avoiding an extra thread pool also helps keep things lean and easier to manage.

I’ll go ahead and update the code to reuse timeoutExecutor accordingly. Appreciate the feedback!

I’ve finished the changes. Would you mind taking a look when you have time?
@funky-eyes @YongGoose

funky-eyes

LGTM

YongGoose · 2025-07-10T02:13:37Z

core/src/test/java/org/apache/seata/core/rpc/netty/ResourceCleanupTest.java

+        handler.exceptionCaught(mockCtx, new IllegalArgumentException("test"));
+
+        Thread.sleep(500);
+        verify(spyManager).releaseChannel(eq(channel), anyString());


Is there a specific reason you used anyString()?
It seems like serverAddress will always be 127.0.0.1:8091, so I’m wondering if matching the exact value would be more appropriate.

You’re right — that makes sense to me as well. I’ll go ahead and update it to match the exact value instead of using anyString().

PeppaO · 2025-07-10T02:50:06Z

This modification cannot solve the problem. I think you can reproduce the problem first, then analyze the cause, and retest after modification.

PeppaO · 2025-07-10T02:52:24Z

To reproduce, start three tc nodes 8091/8092/8093, start a business-xa application, observe the corresponding tcp connection port, then kill 8091, and observe whether the tcp connection is disconnected and reconnected

PeppaO · 2025-07-10T02:54:26Z

After resolving the disconnection and reconnection, perform a stress test to see if the TPS is 0 during the shutdown period after killing one of the tc nodes.
@funky-eyes @Dltmd202

funky-eyes

All places in ChannelHandler that call the clientChannelManager.releaseChannel method need to be made asynchronous.

funky-eyes · 2025-07-16T02:26:43Z

18.18% of diff hit (target 60.50%)
Can you add some test cases to improve the coverage?

Dltmd202 · 2025-07-16T15:42:58Z

Sure, I’ll take care of it!

…a reconnectExecutor

Dltmd202 · 2025-07-20T10:25:35Z

I’ve finished the changes. Would you mind taking a look when you have time?
@funky-eyes

…a reconnectExecutor (apache#7505)

Dltmd202 force-pushed the 7497 branch from 875e558 to 44b5691 Compare July 6, 2025 22:20

Dltmd202 changed the title ~~fix: prevent Netty I/O thread blocking by async channel release via r…~~ bugfix: prevent Netty I/O thread blocking by async channel release via reconnectExecutor Jul 6, 2025

Dltmd202 force-pushed the 7497 branch 5 times, most recently from e24feeb to b60f6d1 Compare July 7, 2025 00:22

Dltmd202 marked this pull request as ready for review July 7, 2025 00:35

funky-eyes reviewed Jul 7, 2025

View reviewed changes

Dltmd202 force-pushed the 7497 branch 2 times, most recently from 6a1dfca to 0489f84 Compare July 9, 2025 11:47

funky-eyes added type: bug Category issues or prs related to bug. module/core core module labels Jul 10, 2025

funky-eyes approved these changes Jul 10, 2025

View reviewed changes

YongGoose reviewed Jul 10, 2025

View reviewed changes

funky-eyes reviewed Jul 10, 2025

View reviewed changes

Dltmd202 force-pushed the 7497 branch 3 times, most recently from 15a1116 to e31f5ab Compare July 12, 2025 14:27

bugfix: prevent Netty I/O thread blocking by async channel release vi…

0c72f3c

…a reconnectExecutor

Dltmd202 force-pushed the 7497 branch from e31f5ab to 0c72f3c Compare July 20, 2025 09:54

funky-eyes added 2 commits July 22, 2025 10:15

Merge branch '2.x' into 7497

700cd5d

Merge branch '2.x' into 7497

2a82afb

funky-eyes added this to the 2.6.0 milestone Jul 24, 2025

funky-eyes merged commit 61d6cb7 into apache:2.x Jul 24, 2025
10 checks passed

slievrly pushed a commit to slievrly/fescar that referenced this pull request Oct 21, 2025

bugfix: prevent Netty I/O thread blocking by async channel release vi…

ba87966

…a reconnectExecutor (apache#7505)

YvCeung pushed a commit to YvCeung/incubator-seata that referenced this pull request Dec 25, 2025

bugfix: prevent Netty I/O thread blocking by async channel release vi…

f33dc6e

…a reconnectExecutor (apache#7505)

bugfix: prevent Netty I/O thread blocking by async channel release via reconnectExecutor #7505

bugfix: prevent Netty I/O thread blocking by async channel release via reconnectExecutor #7505

Uh oh!

Conversation

Dltmd202 commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ⅰ. Describe what this PR did

Ⅱ. Does this pull request fix one issue?

Ⅲ. Why don't you add test cases (unit test/integration test)?

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

Uh oh!

codecov bot commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

funky-eyes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PeppaO commented Jul 10, 2025

Uh oh!

PeppaO commented Jul 10, 2025

Uh oh!

PeppaO commented Jul 10, 2025

Uh oh!

funky-eyes left a comment

Choose a reason for hiding this comment

Uh oh!

funky-eyes commented Jul 16, 2025

Uh oh!

Dltmd202 commented Jul 16, 2025

Uh oh!

Dltmd202 commented Jul 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Dltmd202 commented Jul 6, 2025 •

edited

Loading

codecov bot commented Jul 6, 2025 •

edited

Loading