Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

wojiaodoubao
Copy link
Contributor

Close #4780

@wojiaodoubao wojiaodoubao marked this pull request as draft September 30, 2025 12:51
@github-actions github-actions bot added bug Something isn't working python labels Sep 30, 2025
@wojiaodoubao
Copy link
Contributor Author

I'll rebase this pr after #4752 is merged.

@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2025

Codecov Report

❌ Patch coverage is 95.69536% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.64%. Comparing base (ff69239) to head (00ecf52).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 86.41% 4 Missing and 7 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs 97.22% 0 Missing and 1 partial ⚠️
rust/lance/src/io/exec/fts.rs 88.88% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4859      +/-   ##
==========================================
+ Coverage   81.62%   81.64%   +0.01%     
==========================================
  Files         333      333              
  Lines      131388   131556     +168     
  Branches   131388   131556     +168     
==========================================
+ Hits       107242   107405     +163     
  Misses      20551    20551              
- Partials     3595     3600       +5     
Flag Coverage Δ
unittests 81.64% <95.69%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wojiaodoubao wojiaodoubao force-pushed the fts-flat-match-fix branch 4 times, most recently from 62f02c0 to 6287c7e Compare October 1, 2025 10:09
@wojiaodoubao wojiaodoubao marked this pull request as ready for review October 1, 2025 11:58
@wojiaodoubao
Copy link
Contributor Author

Hi @jackye1995 @BubbleCal , could you help review this pr when you have time, thank very much !

@wjones127 wjones127 self-assigned this Oct 1, 2025
@wojiaodoubao wojiaodoubao force-pushed the fts-flat-match-fix branch 2 times, most recently from 77a6cb8 to 5ce17b3 Compare October 2, 2025 14:10
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exciting that we'll have this soon.

My main concern is understanding if and why into_df_repeat_exec is necessary.

Comment on lines 1139 to 1140
.into_df_repeat_exec(RowCount::from(15), BatchCount::from(2))
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change necessary?

I don't see it being re-used below. Unless it means the query itself has to scan the input data twice. If that's the case, it seems worrying to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the query itself has to scan the input data twice. When we compute score for document in fts, we need some statistics of all documents:

  • number of documents
  • average document length
  • number of documents containing the query term.

The statistics data are saved at BM25Scorer(https://github.com/lancedb/lance/pull/4859/files#diff-60d4f0739e08b11e907ca92afb4b5308809e2a4f71f53418074de3b5f3dbda46R25).

Since we want to support full text search when there is no index or the index is empty, we can't rely on the BM25Scorer based on index. We have to scan first to obtain the BM25Scorer, and then scan a second time to score each document.

Thanks @wjones127 for pointing out the scan twice issue. Scanning unindexed data twice is indeed an issue that cannot be ignored, so I came up with an alternative solution:
we can incrementally calculate the BM25Scorer while scanning the unindexed data. When scoring, we can combine the incremental BM25Scorer with the index BM25Scorer. By sacrificing some score accuracy, we can only scanning unindexed data once. In our original implementation, we use the index BM25Scorer for scoring unindexed data, which actually already sacrifices some accuracy. Therefore, this solution might be feasible.

@wjones127 what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only have to scan twice when there is not index data, then I think that's acceptable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wjones127 , I updated this pr with the new approach which only scan once. Please help review when you have time, thanks very much!

@wojiaodoubao wojiaodoubao force-pushed the fts-flat-match-fix branch 3 times, most recently from 751e4ea to 71c21f0 Compare October 4, 2025 13:25
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good solution for now.

One thing we could consider for doing this in a single pass is making the scoring collect all unindexed documents in memory until all of them have been processed, and then assigning scores after. Could use spilling if this becomes large.

But I think we can leave that for a future enhancement.

@wjones127 wjones127 merged commit b54b3fe into lancedb:main Oct 9, 2025
34 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow querying empty FTS index

3 participants