Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Better support for storing points in multiple lists #5

@thomasahle

Description

@thomasahle

Since 2df6a42 it is possible to store every datapoint in n lists by building with ivf.build(n_probes=n).
This increases performance recall/qps quite a lot, but only when going from n=1 to n=2, as seen in the attached figure.
Figure_1

The problem is probably that duplicate matches aren't handled well.
When calling ctop from ivf, we should somehow tell it about the indices we have already collected so the distance table can focus on telling us about alternative interesting candidates.

One option is to even reuse the query_pq_sse(transformed_data, self.tables, indices, values, True) call by calling it on multiple (transformed_data, tables) pairs while keeping (indices, values) fixed. That way we also would only do rescoring/pass-2 a single time on all candidates retrieved from different lists.

The issue is that
(1) the binary heap data structure we use can't recognize duplicates, and
(2) the query_pq_sse function only knows the "local" id of a point in a list, not the global id.

To solve (2) we could pass a list with the global ids of all the points considered. This would be some extra overhead for query_pq_sse to pass around, but perhaps not that much. And we wouldn't have to "relabel" the returned ids afterwards.

For (1) we could switch back to using insertion sort, or just try heuristically to remove some of the duplicates the heap is able to find.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions