Depends on #3371.
Is your feature request related to a problem? Please describe.
In #2840 we've added checksums for snapshot files. The implementation is somewhat limited however, and requires further integration to make full use of it.
Since Qdrant 1.7 different shard transfer methods are supported. Snapshot transfers have been added to make these transfers more capable by utilizing snapshots.
One problem with this approach is that we have no integrity checks for the actual snapshot files. If such file would become corrupted, Qdrant will happily restore possibly resulting in a broken shard.
Describe the solution you'd like
When a shard snapshot transfer happens, we should check integrity of the snapshot file by verifying the attached checksum. Since #2840, the checksum is attached to the SnapshotDescription object.
#3371 will implement a checksum field in snapshot recovery endpoints. We'll have to wait for this to be implemented so that we can utilize this in the snapshot transfer progress.
The right approach is probably to pass the checksum along in the recovery call here:
|
log::trace!("Transferring and recovering shard {shard_id} snapshot on peer {remote_peer_id}"); |
|
remote_shard |
|
.recover_shard_snapshot_from_url( |
|
collection_name, |
|
shard_id, |
|
&shard_download_url, |
|
SnapshotPriority::ShardTransfer, |
|
) |
|
.await |
|
.map_err(|err| { |
|
CollectionError::service_error(format!( |
|
"Failed to recover shard snapshot on remote: {err}" |
|
)) |
|
})?; |
If checksum verification on the remote node fails, we should clean up the snapshot file and return with an error. Cleaning up on the remote is probably already handled with #3371.
Additional context
There's other work to be done to properly integrate checksums, but that will be handled in different issues/PRs.
Depends on #3371.
Is your feature request related to a problem? Please describe.
In #2840 we've added checksums for snapshot files. The implementation is somewhat limited however, and requires further integration to make full use of it.
Since Qdrant 1.7 different shard transfer methods are supported. Snapshot transfers have been added to make these transfers more capable by utilizing snapshots.
One problem with this approach is that we have no integrity checks for the actual snapshot files. If such file would become corrupted, Qdrant will happily restore possibly resulting in a broken shard.
Describe the solution you'd like
When a shard snapshot transfer happens, we should check integrity of the snapshot file by verifying the attached checksum. Since #2840, the checksum is attached to the
SnapshotDescriptionobject.#3371 will implement a
checksumfield in snapshot recovery endpoints. We'll have to wait for this to be implemented so that we can utilize this in the snapshot transfer progress.The right approach is probably to pass the checksum along in the recovery call here:
qdrant/lib/collection/src/shards/transfer/snapshot.rs
Lines 215 to 228 in c6a351c
If checksum verification on the remote node fails, we should clean up the snapshot file and return with an error. Cleaning up on the remote is probably already handled with #3371.
Additional context
There's other work to be done to properly integrate checksums, but that will be handled in different issues/PRs.