-
Notifications
You must be signed in to change notification settings - Fork 445
fix: fts match query on column without inverted index #4859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'll rebase this pr after #4752 is merged. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4859 +/- ##
==========================================
+ Coverage 81.62% 81.64% +0.01%
==========================================
Files 333 333
Lines 131388 131556 +168
Branches 131388 131556 +168
==========================================
+ Hits 107242 107405 +163
Misses 20551 20551
- Partials 3595 3600 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
62f02c0
to
6287c7e
Compare
Hi @jackye1995 @BubbleCal , could you help review this pr when you have time, thank very much ! |
77a6cb8
to
5ce17b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's exciting that we'll have this soon.
My main concern is understanding if and why into_df_repeat_exec
is necessary.
rust/lance/src/io/exec/fts.rs
Outdated
.into_df_repeat_exec(RowCount::from(15), BatchCount::from(2)) | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this change necessary?
I don't see it being re-used below. Unless it means the query itself has to scan the input data twice. If that's the case, it seems worrying to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the query itself has to scan the input data twice. When we compute score for document in fts, we need some statistics of all documents:
- number of documents
- average document length
- number of documents containing the query term.
The statistics data are saved at BM25Scorer(https://github.com/lancedb/lance/pull/4859/files#diff-60d4f0739e08b11e907ca92afb4b5308809e2a4f71f53418074de3b5f3dbda46R25).
Since we want to support full text search when there is no index or the index is empty, we can't rely on the BM25Scorer based on index. We have to scan first to obtain the BM25Scorer, and then scan a second time to score each document.
Thanks @wjones127 for pointing out the scan twice issue. Scanning unindexed data twice is indeed an issue that cannot be ignored, so I came up with an alternative solution:
we can incrementally calculate the BM25Scorer while scanning the unindexed data. When scoring, we can combine the incremental BM25Scorer with the index BM25Scorer. By sacrificing some score accuracy, we can only scanning unindexed data once. In our original implementation, we use the index BM25Scorer for scoring unindexed data, which actually already sacrifices some accuracy. Therefore, this solution might be feasible.
@wjones127 what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only have to scan twice when there is not index data, then I think that's acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @wjones127 , I updated this pr with the new approach which only scan once. Please help review when you have time, thanks very much!
751e4ea
to
71c21f0
Compare
71c21f0
to
00ecf52
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good solution for now.
One thing we could consider for doing this in a single pass is making the scoring collect all unindexed documents in memory until all of them have been processed, and then assigning scores after. Could use spilling if this becomes large.
But I think we can leave that for a future enhancement.
Close #4780