Codestin Search App

generall · 2026-05-12T09:48:09Z

Fix for https://github.com/qdrant/qdrant/actions/runs/25706195257/job/75476715323?pr=8999

Summary

Fixes a race in tests/consensus_tests/test_failed_snapshot_recovery.py::test_corrupted_snapshot_recovery where the manual recover_shard() call returns 400 "Shard 0 is already involved in transfer" if the cluster's auto-recovery loop initiates the transfer first.
After the restarted peer loads its dummy shard, Collection::recover_dirty_shards (see lib/collection/src/collection/mod.rs:826) auto-proposes a ShardTransfer from another replica. The test was then unconditionally calling recover_shard(), racing with that proposal.
The test already conditionally skips the flag_exists assertion when a transfer is in flight; now it also skips the manual recover_shard() call in the same branch. The subsequent wait_for(flag removed), wait_for_collection_shard_transfers_count, and replica/scroll assertions still verify recovery completes either way.

Timeline from failing run

From peer_3_restarted.log:

01:01:13.735 peer 3 auto-proposes recovery transfer (WalDelta, 7575… → 2573…)
01:01:13.880 transfer applied; shard Dead → Recovery
01:01:14.832 test polls /cluster (transfer visible)
01:01:14.949 test's manual POST /cluster → 400 "Shard 0 is already involved in transfer"

Test plan

CI green on test_corrupted_snapshot_recovery (consensus tests)
Local reruns with concurrent xdist load do not reproduce the 400

🤖 Generated with Claude Code

When the restarted peer's dummy shard is auto-recovered by the cluster's recovery loop before the test issues its manual `replicate_shard` call, the manual call returns 400 "already involved in transfer". Skip the manual call when a transfer is already in flight — the existing wait_for / transfer-count / replica assertions still verify the shard recovers. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

timvisee · 2026-05-12T13:32:32Z

+        # Initiate transfer from another peer to recover the shard.
+        # This part checks that dummy shard can be recovered with shard transfer.
+        # If a transfer was already auto-initiated by the cluster recovery loop,
+        # skip the manual call — it would fail with 400 "already involved in transfer".
+        recover_shard(peer_api_uris[-1], collection_name=COLLECTION_NAME)


There is no code change here

generall requested review from KShivendu and timvisee May 12, 2026 09:48

This comment was marked as resolved.

Sign in to view

github-actions Bot mentioned this pull request May 12, 2026

Flaky test hnsw_quantized_search_test::hnsw_turbo_quantization_cosine_larger_bits2_test #8835

Open

timvisee requested changes May 12, 2026

View reviewed changes

qdrant deleted a comment from coderabbitai Bot May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix race in test_corrupted_snapshot_recovery#9013

test: fix race in test_corrupted_snapshot_recovery#9013
generall wants to merge 1 commit into
devfrom
fix/flaky-corrupted-snapshot-recovery-test

generall commented May 12, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

timvisee May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

generall commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Timeline from failing run

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

timvisee May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

generall commented May 12, 2026 •

edited

Loading