Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

wojiaodoubao
Copy link
Contributor

Close #4780

@wojiaodoubao wojiaodoubao marked this pull request as draft September 30, 2025 12:51
@github-actions github-actions bot added bug Something isn't working python labels Sep 30, 2025
@wojiaodoubao
Copy link
Contributor Author

I'll rebase this pr after #4752 is merged.

@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2025

Codecov Report

❌ Patch coverage is 88.00000% with 72 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.89%. Comparing base (492e773) to head (5ce17b3).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-datafusion/src/exec.rs 39.68% 38 Missing ⚠️
rust/lance-index/src/scalar/inverted/scorer.rs 77.96% 13 Missing ⚠️
rust/lance/src/dataset/scanner.rs 86.41% 4 Missing and 7 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs 97.48% 3 Missing and 2 partials ⚠️
rust/lance-datafusion/src/datagen.rs 85.71% 1 Missing and 2 partials ⚠️
rust/lance/src/io/exec/fts.rs 91.66% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4859      +/-   ##
==========================================
+ Coverage   80.87%   80.89%   +0.01%     
==========================================
  Files         332      333       +1     
  Lines      131687   132457     +770     
  Branches   131687   132457     +770     
==========================================
+ Hits       106507   107150     +643     
- Misses      21430    21549     +119     
- Partials     3750     3758       +8     
Flag Coverage Δ
unittests 80.89% <88.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wojiaodoubao wojiaodoubao force-pushed the fts-flat-match-fix branch 4 times, most recently from 62f02c0 to 6287c7e Compare October 1, 2025 10:09
@wojiaodoubao wojiaodoubao marked this pull request as ready for review October 1, 2025 11:58
@wojiaodoubao
Copy link
Contributor Author

Hi @jackye1995 @BubbleCal , could you help review this pr when you have time, thank very much !

@wjones127 wjones127 self-assigned this Oct 1, 2025
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exciting that we'll have this soon.

My main concern is understanding if and why into_df_repeat_exec is necessary.

.unwrap();
let schema = batch.schema();
let batches = RecordBatchIterator::new(vec![batch].into_iter().map(Ok), schema);
let mut dataset = Dataset::write(batches, &test_uri, None).await.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend for most tests to just use the in-memory store:

Suggested change
let mut dataset = Dataset::write(batches, &test_uri, None).await.unwrap();
let mut dataset = Dataset::write(batches, "memory://", None).await.unwrap();

Comment on lines +5777 to +5778
let title_col = GenericStringArray::<i32>::from(Vec::<&str>::new());
let content_col = GenericStringArray::<i32>::from(Vec::<&str>::new());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be StringArray:

Suggested change
let title_col = GenericStringArray::<i32>::from(Vec::<&str>::new());
let content_col = GenericStringArray::<i32>::from(Vec::<&str>::new());
let title_col = StringArray::from(Vec::<&str>::new());
let content_col = StringArray::from(Vec::<&str>::new());

Comment on lines +1139 to +1140
.into_df_repeat_exec(RowCount::from(15), BatchCount::from(2))
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change necessary?

I don't see it being re-used below. Unless it means the query itself has to scan the input data twice. If that's the case, it seems worrying to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the query itself has to scan the input data twice. When we compute score for document in fts, we need some statistics of all documents:

  • number of documents
  • average document length
  • number of documents containing the query term.

The statistics data are saved at BM25Scorer(https://github.com/lancedb/lance/pull/4859/files#diff-60d4f0739e08b11e907ca92afb4b5308809e2a4f71f53418074de3b5f3dbda46R25).

Since we want to support full text search when there is no index or the index is empty, we can't rely on the BM25Scorer based on index. We have to scan first to obtain the BM25Scorer, and then scan a second time to score each document.

Thanks @wjones127 for pointing out the scan twice issue. Scanning unindexed data twice is indeed an issue that cannot be ignored, so I came up with an alternative solution:
we can incrementally calculate the BM25Scorer while scanning the unindexed data. When scoring, we can combine the incremental BM25Scorer with the index BM25Scorer. By sacrificing some score accuracy, we can only scanning unindexed data once. In our original implementation, we use the index BM25Scorer for scoring unindexed data, which actually already sacrifices some accuracy. Therefore, this solution might be feasible.

@wjones127 what do you think?

}
}

pub fn avgdl(&self) -> f32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method name is very hard to read. Not even sure if I got the name right? Could you make it full words?

Suggested change
pub fn avgdl(&self) -> f32 {
pub fn avg_doc_length(&self) -> f32 {

}

// the number of documents that contain the token
pub fn nq(&self, token: &str) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here about the method name. I like that you have a comment describing it though.

let mut tokenizer = Box::new(index.tokenizer.clone());
let mut tokenizer = match index {
Some(index) => index.tokenizer(),
// TODO: allow users to specify a tokenizer when querying columns without an inverted index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think creating an index is how they should specify the tokenizer. Even if they just pass train=False, we at least get the configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow querying empty FTS index
3 participants