Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Randomize cluster startup node order during topology refresh#4060

Merged
petyaslavova merged 13 commits into
masterfrom
ps_randomize_startup_nodes_on_cluster_topology_reinitialization
May 20, 2026
Merged

Randomize cluster startup node order during topology refresh#4060
petyaslavova merged 13 commits into
masterfrom
ps_randomize_startup_nodes_on_cluster_topology_reinitialization

Conversation

@petyaslavova

@petyaslavova petyaslavova commented May 12, 2026

Copy link
Copy Markdown
Collaborator

Randomizes the startup node iteration order during cluster topology initialization for both sync and async clients. This prevents many clients from consistently querying the same first startup node when reinitializing cluster state.

The implementation copies startup_nodes to a list, shuffles it when multiple nodes are available, and then proceeds with the existing initialization flow. Sync behavior still includes any additional startup nodes after the shuffled startup node list, preserving the existing MOVED refresh path behavior.

Adds sync and async cluster tests that use the real cluster fixture and mock only random.shuffle to make the order deterministic. The tests verify that initialization queries the node that becomes first after shuffling.

Fixes #4049


Note

Medium Risk
Changes cluster topology refresh/initialization ordering (sync + asyncio) and retry behavior by deferring the last failed node, which could affect failover/reconnect paths in production clusters. Test updates reduce flakiness but new shuffle/defer logic may surface edge cases with dynamic startup nodes and additional startup nodes.

Overview
Randomizes Redis Cluster topology refresh startup node selection for both sync and asyncio clients by shuffling the startup_nodes iteration order when multiple nodes exist, reducing the chance that many clients hammer the same first node.

Propagates a last_failed_node_name hint through retry/refresh paths so the node that just errored is tried after other startup and additional_startup_nodes during reinitialization (sync redis/cluster.py, async redis/asyncio/cluster.py).

Adds/adjusts tests to deterministically validate the shuffle behavior (mocking random.shuffle), updates maintenance-notification tests to pin startup-node ordering, and makes lock blocking-timeout tests deterministic by monkeypatching time/asyncio sleep.

Reviewed by Cursor Bugbot for commit caba14d. Bugbot is set up for automated code reviews on this repo. Configure here.

@petyaslavova petyaslavova added the maintenance Maintenance (CI, Releases, etc) label May 12, 2026
@jit-ci

jit-ci Bot commented May 12, 2026

Copy link
Copy Markdown

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

…ceive the last failed node as argument and it is moved to be the last option for topology refresh
Comment thread redis/asyncio/cluster.py Outdated
Comment thread redis/asyncio/cluster.py
…etter maint notifications behaviour the randomization is mocked to keep the original order
@petyaslavova petyaslavova requested a review from vladvildanov May 15, 2026 13:01
Comment thread redis/asyncio/cluster.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8fcae90. Configure here.

Comment thread tests/test_lock.py
@petyaslavova petyaslavova merged commit 32be6a5 into master May 20, 2026
679 of 680 checks passed
@petyaslavova petyaslavova deleted the ps_randomize_startup_nodes_on_cluster_topology_reinitialization branch May 20, 2026 19:18
petyaslavova added a commit that referenced this pull request May 26, 2026
* Randomize cluster startup node order during topology refresh

* Fixing failing tests and adding randomization improvement - now if receive the last failed node as argument and it is moved to be the last option for topology refresh

* Fixing linters

* Applying review comment

* Fixing flaky test - flakiness appeared after the randomization. For better maint notifications behaviour the randomization is mocked to keep the original order

* Fix flaky tests by introducing mocked timer

* Fixing tests after bad conflict resolution
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maintenance Maintenance (CI, Releases, etc)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CLUSTER SLOTS re-initialization cascade: deterministic startup_nodes ordering causes network saturation on first-slot node in large clusters

2 participants