fix: fts match query on column without inverted index #4859

wojiaodoubao · 2025-09-30T12:51:10Z

wojiaodoubao · 2025-09-30T12:51:52Z

I'll rebase this pr after #4752 is merged.

codecov-commenter · 2025-09-30T13:32:38Z

Codecov Report

❌ Patch coverage is 88.00000% with 72 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.89%. Comparing base (492e773) to head (5ce17b3).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance-datafusion/src/exec.rs	39.68%	38 Missing ⚠️
rust/lance-index/src/scalar/inverted/scorer.rs	77.96%	13 Missing ⚠️
rust/lance/src/dataset/scanner.rs	86.41%	4 Missing and 7 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs	97.48%	3 Missing and 2 partials ⚠️
rust/lance-datafusion/src/datagen.rs	85.71%	1 Missing and 2 partials ⚠️
rust/lance/src/io/exec/fts.rs	91.66%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4859      +/-   ##
==========================================
+ Coverage   80.87%   80.89%   +0.01%     
==========================================
  Files         332      333       +1     
  Lines      131687   132457     +770     
  Branches   131687   132457     +770     
==========================================
+ Hits       106507   107150     +643     
- Misses      21430    21549     +119     
- Partials     3750     3758       +8

Flag	Coverage Δ
unittests	`80.89% <88.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wojiaodoubao · 2025-10-01T11:59:28Z

Hi @jackye1995 @BubbleCal , could you help review this pr when you have time, thank very much !

wjones127

It's exciting that we'll have this soon.

My main concern is understanding if and why into_df_repeat_exec is necessary.

wjones127 · 2025-10-02T16:33:23Z

rust/lance/src/dataset.rs

+        .unwrap();
+        let schema = batch.schema();
+        let batches = RecordBatchIterator::new(vec![batch].into_iter().map(Ok), schema);
+        let mut dataset = Dataset::write(batches, &test_uri, None).await.unwrap();


I'd recommend for most tests to just use the in-memory store:

Suggested change

let mut dataset = Dataset::write(batches, &test_uri, None).await.unwrap();

let mut dataset = Dataset::write(batches, "memory://", None).await.unwrap();

wjones127 · 2025-10-02T16:35:24Z

rust/lance/src/dataset.rs

+        let title_col = GenericStringArray::<i32>::from(Vec::<&str>::new());
+        let content_col = GenericStringArray::<i32>::from(Vec::<&str>::new());


This could just be StringArray:

Suggested change

let title_col = GenericStringArray::<i32>::from(Vec::<&str>::new());

let content_col = GenericStringArray::<i32>::from(Vec::<&str>::new());

let title_col = StringArray::from(Vec::<&str>::new());

let content_col = StringArray::from(Vec::<&str>::new());

wjones127 · 2025-10-02T16:38:03Z

rust/lance/src/io/exec/fts.rs

+            .into_df_repeat_exec(RowCount::from(15), BatchCount::from(2))
+            .unwrap();


Why was this change necessary?

I don't see it being re-used below. Unless it means the query itself has to scan the input data twice. If that's the case, it seems worrying to me.

Yes, the query itself has to scan the input data twice. When we compute score for document in fts, we need some statistics of all documents:

number of documents

average document length

number of documents containing the query term.

The statistics data are saved at BM25Scorer(https://github.com/lancedb/lance/pull/4859/files#diff-60d4f0739e08b11e907ca92afb4b5308809e2a4f71f53418074de3b5f3dbda46R25).

Since we want to support full text search when there is no index or the index is empty, we can't rely on the BM25Scorer based on index. We have to scan first to obtain the BM25Scorer, and then scan a second time to score each document.

Thanks @wjones127 for pointing out the scan twice issue. Scanning unindexed data twice is indeed an issue that cannot be ignored, so I came up with an alternative solution:
we can incrementally calculate the BM25Scorer while scanning the unindexed data. When scoring, we can combine the incremental BM25Scorer with the index BM25Scorer. By sacrificing some score accuracy, we can only scanning unindexed data once. In our original implementation, we use the index BM25Scorer for scoring unindexed data, which actually already sacrifices some accuracy. Therefore, this solution might be feasible.

@wjones127 what do you think?

wjones127 · 2025-10-02T16:44:05Z

rust/lance-index/src/scalar/inverted/scorer.rs

+        }
+    }
+
+    pub fn avgdl(&self) -> f32 {


This method name is very hard to read. Not even sure if I got the name right? Could you make it full words?

Suggested change

pub fn avgdl(&self) -> f32 {

pub fn avg_doc_length(&self) -> f32 {

wjones127 · 2025-10-02T16:45:02Z

rust/lance-index/src/scalar/inverted/scorer.rs

+    }
+
+    // the number of documents that contain the token
+    pub fn nq(&self, token: &str) -> usize {


Same thing here about the method name. I like that you have a comment describing it though.

wjones127 · 2025-10-02T16:53:45Z

rust/lance-index/src/scalar/inverted/index.rs

-    let mut tokenizer = Box::new(index.tokenizer.clone());
+    let mut tokenizer = match index {
+        Some(index) => index.tokenizer(),
+        // TODO: allow users to specify a tokenizer when querying columns without an inverted index.


I think creating an index is how they should specify the tokenizer. Even if they just pass train=False, we at least get the configuration.

wojiaodoubao marked this pull request as draft September 30, 2025 12:51

github-actions bot added bug Something isn't working python labels Sep 30, 2025

wojiaodoubao force-pushed the fts-flat-match-fix branch 4 times, most recently from 62f02c0 to 6287c7e Compare October 1, 2025 10:09

wojiaodoubao marked this pull request as ready for review October 1, 2025 11:58

wjones127 self-assigned this Oct 1, 2025

wojiaodoubao force-pushed the fts-flat-match-fix branch from 6287c7e to 77a6cb8 Compare October 2, 2025 14:10

fix: fts match query on column without inverted index

5ce17b3

wojiaodoubao force-pushed the fts-flat-match-fix branch from 77a6cb8 to 5ce17b3 Compare October 2, 2025 14:10

wjones127 reviewed Oct 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fts match query on column without inverted index #4859

fix: fts match query on column without inverted index #4859

Uh oh!

wojiaodoubao commented Sep 30, 2025

Uh oh!

wojiaodoubao commented Sep 30, 2025

Uh oh!

codecov-commenter commented Sep 30, 2025 •

edited

Loading

Uh oh!

wojiaodoubao commented Oct 1, 2025

Uh oh!

wjones127 left a comment

Uh oh!

wjones127 Oct 2, 2025

Uh oh!

wjones127 Oct 2, 2025

Uh oh!

wjones127 Oct 2, 2025

Uh oh!

wojiaodoubao Oct 3, 2025

Uh oh!

wjones127 Oct 2, 2025

Uh oh!

wjones127 Oct 2, 2025

Uh oh!

wjones127 Oct 2, 2025

Uh oh!

Uh oh!

	let mut dataset = Dataset::write(batches, &test_uri, None).await.unwrap();
	let mut dataset = Dataset::write(batches, "memory://", None).await.unwrap();

		let title_col = GenericStringArray::<i32>::from(Vec::<&str>::new());
		let content_col = GenericStringArray::<i32>::from(Vec::<&str>::new());

		.into_df_repeat_exec(RowCount::from(15), BatchCount::from(2))
		.unwrap();

	pub fn avgdl(&self) -> f32 {
	pub fn avg_doc_length(&self) -> f32 {

fix: fts match query on column without inverted index #4859

Are you sure you want to change the base?

fix: fts match query on column without inverted index #4859

Uh oh!

Conversation

wojiaodoubao commented Sep 30, 2025

Uh oh!

wojiaodoubao commented Sep 30, 2025

Uh oh!

codecov-commenter commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wojiaodoubao commented Oct 1, 2025

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

wjones127 Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

wjones127 Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

wjones127 Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

wojiaodoubao Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

wjones127 Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

wjones127 Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

wjones127 Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Sep 30, 2025 •

edited

Loading