-
Notifications
You must be signed in to change notification settings - Fork 474
fix: fts match query on column without inverted index #4859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I'll rebase this pr after #4752 is merged. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4859 +/- ##
==========================================
+ Coverage 80.87% 80.89% +0.01%
==========================================
Files 332 333 +1
Lines 131687 132457 +770
Branches 131687 132457 +770
==========================================
+ Hits 106507 107150 +643
- Misses 21430 21549 +119
- Partials 3750 3758 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
62f02c0
to
6287c7e
Compare
Hi @jackye1995 @BubbleCal , could you help review this pr when you have time, thank very much ! |
6287c7e
to
77a6cb8
Compare
77a6cb8
to
5ce17b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's exciting that we'll have this soon.
My main concern is understanding if and why into_df_repeat_exec
is necessary.
.unwrap(); | ||
let schema = batch.schema(); | ||
let batches = RecordBatchIterator::new(vec![batch].into_iter().map(Ok), schema); | ||
let mut dataset = Dataset::write(batches, &test_uri, None).await.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend for most tests to just use the in-memory store:
let mut dataset = Dataset::write(batches, &test_uri, None).await.unwrap(); | |
let mut dataset = Dataset::write(batches, "memory://", None).await.unwrap(); |
let title_col = GenericStringArray::<i32>::from(Vec::<&str>::new()); | ||
let content_col = GenericStringArray::<i32>::from(Vec::<&str>::new()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just be StringArray
:
let title_col = GenericStringArray::<i32>::from(Vec::<&str>::new()); | |
let content_col = GenericStringArray::<i32>::from(Vec::<&str>::new()); | |
let title_col = StringArray::from(Vec::<&str>::new()); | |
let content_col = StringArray::from(Vec::<&str>::new()); |
.into_df_repeat_exec(RowCount::from(15), BatchCount::from(2)) | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this change necessary?
I don't see it being re-used below. Unless it means the query itself has to scan the input data twice. If that's the case, it seems worrying to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the query itself has to scan the input data twice. When we compute score for document in fts, we need some statistics of all documents:
- number of documents
- average document length
- number of documents containing the query term.
The statistics data are saved at BM25Scorer(https://github.com/lancedb/lance/pull/4859/files#diff-60d4f0739e08b11e907ca92afb4b5308809e2a4f71f53418074de3b5f3dbda46R25).
Since we want to support full text search when there is no index or the index is empty, we can't rely on the BM25Scorer based on index. We have to scan first to obtain the BM25Scorer, and then scan a second time to score each document.
Thanks @wjones127 for pointing out the scan twice issue. Scanning unindexed data twice is indeed an issue that cannot be ignored, so I came up with an alternative solution:
we can incrementally calculate the BM25Scorer while scanning the unindexed data. When scoring, we can combine the incremental BM25Scorer with the index BM25Scorer. By sacrificing some score accuracy, we can only scanning unindexed data once. In our original implementation, we use the index BM25Scorer for scoring unindexed data, which actually already sacrifices some accuracy. Therefore, this solution might be feasible.
@wjones127 what do you think?
} | ||
} | ||
|
||
pub fn avgdl(&self) -> f32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method name is very hard to read. Not even sure if I got the name right? Could you make it full words?
pub fn avgdl(&self) -> f32 { | |
pub fn avg_doc_length(&self) -> f32 { |
} | ||
|
||
// the number of documents that contain the token | ||
pub fn nq(&self, token: &str) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here about the method name. I like that you have a comment describing it though.
let mut tokenizer = Box::new(index.tokenizer.clone()); | ||
let mut tokenizer = match index { | ||
Some(index) => index.tokenizer(), | ||
// TODO: allow users to specify a tokenizer when querying columns without an inverted index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think creating an index is how they should specify the tokenizer. Even if they just pass train=False
, we at least get the configuration.
Close #4780