fix: fts match query on column without inverted index #4859

wojiaodoubao · 2025-09-30T12:51:10Z

wojiaodoubao · 2025-09-30T12:51:52Z

I'll rebase this pr after #4752 is merged.

codecov-commenter · 2025-09-30T13:32:38Z

Codecov Report

❌ Patch coverage is 95.69536% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.64%. Comparing base (ff69239) to head (00ecf52).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/scanner.rs	86.41%	4 Missing and 7 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs	97.22%	0 Missing and 1 partial ⚠️
rust/lance/src/io/exec/fts.rs	88.88%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4859      +/-   ##
==========================================
+ Coverage   81.62%   81.64%   +0.01%     
==========================================
  Files         333      333              
  Lines      131388   131556     +168     
  Branches   131388   131556     +168     
==========================================
+ Hits       107242   107405     +163     
  Misses      20551    20551              
- Partials     3595     3600       +5

Flag	Coverage Δ
unittests	`81.64% <95.69%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wojiaodoubao · 2025-10-01T11:59:28Z

Hi @jackye1995 @BubbleCal , could you help review this pr when you have time, thank very much !

wjones127

It's exciting that we'll have this soon.

My main concern is understanding if and why into_df_repeat_exec is necessary.

rust/lance/src/dataset.rs

wjones127 · 2025-10-02T16:38:03Z

rust/lance/src/io/exec/fts.rs

+            .into_df_repeat_exec(RowCount::from(15), BatchCount::from(2))
+            .unwrap();


Why was this change necessary?

I don't see it being re-used below. Unless it means the query itself has to scan the input data twice. If that's the case, it seems worrying to me.

Yes, the query itself has to scan the input data twice. When we compute score for document in fts, we need some statistics of all documents:

number of documents

average document length

number of documents containing the query term.

The statistics data are saved at BM25Scorer(https://github.com/lancedb/lance/pull/4859/files#diff-60d4f0739e08b11e907ca92afb4b5308809e2a4f71f53418074de3b5f3dbda46R25).

Since we want to support full text search when there is no index or the index is empty, we can't rely on the BM25Scorer based on index. We have to scan first to obtain the BM25Scorer, and then scan a second time to score each document.

Thanks @wjones127 for pointing out the scan twice issue. Scanning unindexed data twice is indeed an issue that cannot be ignored, so I came up with an alternative solution:
we can incrementally calculate the BM25Scorer while scanning the unindexed data. When scoring, we can combine the incremental BM25Scorer with the index BM25Scorer. By sacrificing some score accuracy, we can only scanning unindexed data once. In our original implementation, we use the index BM25Scorer for scoring unindexed data, which actually already sacrifices some accuracy. Therefore, this solution might be feasible.

@wjones127 what do you think?

If we only have to scan twice when there is not index data, then I think that's acceptable.

Hi @wjones127 , I updated this pr with the new approach which only scan once. Please help review when you have time, thanks very much!

rust/lance-index/src/scalar/inverted/scorer.rs

rust/lance-index/src/scalar/inverted/index.rs

wjones127

This seems like a good solution for now.

One thing we could consider for doing this in a single pass is making the scoring collect all unindexed documents in memory until all of them have been processed, and then assigning scores after. Could use spilling if this becomes large.

But I think we can leave that for a future enhancement.

wojiaodoubao marked this pull request as draft September 30, 2025 12:51

github-actions bot added bug Something isn't working python labels Sep 30, 2025

wojiaodoubao force-pushed the fts-flat-match-fix branch 4 times, most recently from 62f02c0 to 6287c7e Compare October 1, 2025 10:09

wojiaodoubao marked this pull request as ready for review October 1, 2025 11:58

wjones127 self-assigned this Oct 1, 2025

wojiaodoubao force-pushed the fts-flat-match-fix branch 2 times, most recently from 77a6cb8 to 5ce17b3 Compare October 2, 2025 14:10

wjones127 reviewed Oct 2, 2025

View reviewed changes

wojiaodoubao force-pushed the fts-flat-match-fix branch 3 times, most recently from 751e4ea to 71c21f0 Compare October 4, 2025 13:25

lijinglun added 2 commits October 8, 2025 14:24

fix: fts match query on column without inverted index

aa2641b

scan once and some fix

00ecf52

wojiaodoubao force-pushed the fts-flat-match-fix branch from 71c21f0 to 00ecf52 Compare October 8, 2025 06:28

wjones127 approved these changes Oct 9, 2025

View reviewed changes

wjones127 merged commit b54b3fe into lancedb:main Oct 9, 2025
34 of 40 checks passed

		.into_df_repeat_exec(RowCount::from(15), BatchCount::from(2))
		.unwrap();

fix: fts match query on column without inverted index #4859

fix: fts match query on column without inverted index #4859

Uh oh!

Conversation

wojiaodoubao commented Sep 30, 2025

Uh oh!

wojiaodoubao commented Sep 30, 2025

Uh oh!

codecov-commenter commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wojiaodoubao commented Oct 1, 2025

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wjones127 Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

wojiaodoubao Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

wjones127 Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

wojiaodoubao Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Sep 30, 2025 •

edited

Loading