Better support for storing points in multiple lists

Since 2df6a428cf6bcc4e4a08f15e3f7caef9ce5f4f61 it is possible to store every datapoint in `n` lists by building with `ivf.build(n_probes=n)`.
This increases performance recall/qps quite a lot, but only when going from `n=1` to `n=2`, as seen in the attached figure.
![Figure_1](https://user-images.githubusercontent.com/946355/231255348-a75439d9-a66d-4613-b194-9f51dde8574b.png)

The problem is probably that duplicate matches aren't handled well.
When calling `ctop` from ivf, we should somehow tell it about the indices we have already collected so the distance table can focus on telling us about alternative interesting candidates.

One option is to even reuse the `query_pq_sse(transformed_data, self.tables, indices, values, True)` call by calling it on multiple `(transformed_data, tables)` pairs while keeping `(indices, values)` fixed. That way we also would only do rescoring/pass-2 a single time on all candidates retrieved from different lists.

The issue is that
(1) the binary heap data structure we use can't recognize duplicates, and
(2) the `query_pq_sse` function only knows the "local" id of a point in a list, not the global id.

To solve (2) we could pass a list with the global ids of all the points considered. This would be some extra overhead for `query_pq_sse` to pass around, but perhaps not that much. And we wouldn't have to "relabel" the returned ids afterwards.

For (1) we could switch back to using insertion sort, or just try heuristically to remove some of the duplicates the heap is able to find.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better support for storing points in multiple lists #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Better support for storing points in multiple lists #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions