Merge blevesearch master changes #1

cmrajan · 2025-03-07T00:25:05Z

No description provided.

Normalization of accented letters only happens if the input is larger than 5 characters, something that, for example, neither `guía` nor `fría` comply. The solution would be to always execute the accented characters normalization, by moving it to a separate file just like it is done in the german analyzer. Fixes: #1956

Co-authored-by: Rahul Rampure <[email protected]> Co-authored-by: Aditi Ahuja <[email protected]> Co-authored-by: Likith B <[email protected]> Co-authored-by: Mohd Shaad Khan <[email protected]> Co-authored-by: Thejas-bhat <[email protected]>

…#2000)

- Instead of Merging all the PreSearch Results in one shot at the main coordinator node of an alias tree, merge them incrementally at each level of the tree instead, which would balance the reduction process across all the indexes in a distributed Bleve index, leading to a more even memory distribution. --------- Co-authored-by: Abhinav Dangeti <[email protected]>

…2004) This change is intended to save compute in determining "vectorIdsToExclude" during search(..). Requires: - blevesearch/scorch_segment_api#40 - blevesearch/zapx#229

- There are chances that the merger doesn't see any eligible segments to be merged in the current iteration which causes the tasks list to be empty. In this situation, merger which didn't update the root snapshot would notify the persister. - Now, if the persister was napping at this point of time, and assuming there were mutations coming into the system (so the root snapshot would be updated by the introducer) would lead to the persister to be awoken and start flushing out the in-memory segments to disk. - Perhaps a better behaviour would be to not let the merger notify the persister when there is no change in the root snapshot. This would help the persister to perform healthier in memory merging of the segments before persisting out to disk thereby helping in controlling the number of IO ops scorch would do. Some numbers on local testing (~4.18M dataset with lorem ipsum content) With patch ``` "num_bytes_used_ram": 186097560, "test_bucket:test_bucket._default.test:num_file_merge_ops": 7, "test_bucket:test_bucket._default.test:num_mutations_to_index": 0, "test_bucket:test_bucket._default.test:num_persister_nap_merger_break": 4, "test_bucket:test_bucket._default.test:num_persister_nap_pause_completed": 80, ``` Without patch ``` "num_bytes_used_ram": 265234328, "test_bucket:test_bucket._default.test:num_file_merge_ops": 10, "test_bucket:test_bucket._default.test:num_mutations_to_index": 0, "test_bucket:test_bucket._default.test:num_persister_nap_merger_break": 69, "test_bucket:test_bucket._default.test:num_persister_nap_pause_completed": 45, ``` --------- Co-authored-by: Abhinav Dangeti <[email protected]>

…2003)" (#2005) This reverts commit 4fd8313.

Includes: * 9e2514f Aditi Ahuja | MB-60943 - Add a coarse quantiser to the IVF indexes (https://github.com/blevesearch/zapx/pulls/225)

@Thejas-bhat

…#2006) Authored-by: @Thejas-bhat Original: #2003 - There are chances that the merger doesn't see any eligible segments to be merged in the current iteration which causes the tasks list to be empty. In this situation, merger which didn't update the root snapshot would notify the persister. - Now, if the persister was napping at this point of time, and assuming there were mutations coming into the system (so the root snapshot would be updated by the introducer) would lead to the persister to be awoken and start flushing out the in-memory segments to disk. - Perhaps a better behaviour would be to let the persister nap for the remaining duration such and then let the persister do some work. This would also help in merger waiting for the notification reply (which is like an interrupt fashion type of wait) rather than doing something more expensive of letting merger continue to do work (which the earlier commits of this PR was doing). Some numbers on local testing (~4.18M dataset with lorem ipsum content) With patch ``` "num_bytes_used_ram": 224100280, "num_files_on_disk": 34, "test_bucket:test_bucket._default.travel:num_file_merge_ops": 6, "test_bucket:test_bucket._default.travel:num_file_merge_plan": 0, "test_bucket:test_bucket._default.travel:num_files_on_disk": 34, "test_bucket:test_bucket._default.travel:num_mem_merge_ops": 70, "test_bucket:test_bucket._default.travel:num_persister_nap_merger_break": 4, "test_bucket:test_bucket._default.travel:num_persister_nap_pause_completed": 69, "test_bucket:test_bucket._default.travel:num_root_filesegments": 16, "test_bucket:test_bucket._default.travel:num_root_memorysegments": 0, "TotFileMergePlan": 45, "TotFileMergePlanErr": 0, "TotFileMergePlanNone": 1, "TotFileMergePlanOk": 44, ``` Without patch ``` "num_bytes_used_ram": 252726152, "num_files_on_disk": 45, "test_bucket:test_bucket._default.travel:num_file_merge_ops": 11, "test_bucket:test_bucket._default.travel:num_file_merge_plan": 0, "test_bucket:test_bucket._default.travel:num_files_on_disk": 45, "test_bucket:test_bucket._default.travel:num_mem_merge_ops": 129, "test_bucket:test_bucket._default.travel:num_persister_nap_merger_break": 85, "test_bucket:test_bucket._default.travel:num_persister_nap_pause_completed": 44, "test_bucket:test_bucket._default.travel:num_root_filesegments": 33, "test_bucket:test_bucket._default.travel:num_root_memorysegments": 0, "TotFileMergePlan": 96, "TotFileMergePlanErr": 0, "TotFileMergePlanNone": 1, "TotFileMergePlanOk": 95, ``` --------- Co-authored-by: Thejas-bhat <[email protected]>

…m merger" (#2010) This reverts commit 2d81bf0 ( #2006 ) on account of the regression highlighted with MB-61447.

Signed-off-by: mountcount <[email protected]>

* 91a5e17 Abhi Dangeti | Revert "MB-60943 - Add a coarse quantiser to the IVF indexes" (blevesearch/zapx#232)

…s` (#2014)

Includes: * eeb2336 Likith B | MB-61029: Caching Vec To DocID Map (blevesearch/zapx#231) * b2384fc Rahul Rampure | minor optimizations and bug fixes (blevesearch/zapx#233) * b56abea Thejas-bhat | MB-61029: Deferring the closing of vector index (blevesearch/zapx#226)

- Added a new field type called vector_base64. - Acts similar to vector in most cases. - When a new document arrives in the bleve layer, during the parsing of all its fields in processProperty, if the field mapping type is vector-base64, then its value is decoded into a vector field and processed like a vector. - The standard golang base64 library is used for the decode operation. --------- Co-authored-by: Abhinav Dangeti <[email protected]>

- Fix unit tests that were racy and failed intermittently.

- Fix the BytesRead and BytesStored unit tests to pass when using the `vectors` tag. The following commands can be used to validate that all UTs now pass. - go test -tags=vectors -race ./... - go test -race ./... --------- Co-authored-by: Abhinav Dangeti <[email protected]>

+ Allow calling application to change min/max. + Will require downstream changes to disallow any setting greater than 2048 in the event of a mixed version cluster.

+ Test(s) in this file hang on my computer and it seems they've started hanging on the workflow jobs as well.

) + Omitting the score from the document hits when score:"none" causes an older SDK to panic that was expecting to see the attribute regardless. + This commit reverts a portion of what was proposed with #1930.

blevesearch/go-faiss: * 693b06a Rahul Rampure | MB-61650: Release IDSelectorBatch's batchSelector to avoid memory leak

Includes: * d8f2ddf Abhi Dangeti | MB-60697: Windows requires nprobe to be of 'long long' type * 9bb55f8 Abhi Dangeti | Retain IDSelector's Delete() API for complete-ness

Includes: * 6fe4e6b Aditi Ahuja | MB-60943 - Reduce number of centroids for IVF indexes. * 8de5651 Rahul Rampure | add map capacity

…plan computation (#2002) Existing merge policy use a segment's total and live docs count as the measure for determining the merge tasks. A doc may contain multiple vector fields ( even chunked vectors ) , of varying vector dimensions, this usually means that index size will be greater than docs size by orders of magnitude. Thus, docs count as merge policy measure, can easily lead us to formation of huge segments. That's a problem during merge and search time. Thus, this PR intends to utilize segment fileSize as an additional limiting check, to ensure that we don't end up creating huge segments. example: 2M docs, each with a vector field of 2048 dimensions size := 2000000 * 2048 * 4 bytes = 16GB

Addresses: #2028

- Refactor the presearch code path to make it more generic and extensible. - Add an ExtractFields API for obtaining the set of fields applicable to a generic query. - Add support for a new API to set the index mapping for an alias. This is useful when an alias contains partitions of the same index, as the index mapping would be consistent across all indexes and can be inferred directly from the alias.

…2117) + Setting the default to 0 on account of the panics caught in the MB. + Firstly, refactor `FieldTFRCacheThreshold` to `fieldTFRCacheThreshold` for _some_ naming consistency here. + This threshold can be used to toggle recycling of TermFieldReaders on/off. + Couchbase users will have the ability to provide this setting within a JSON payload, which when interpreted into a `map[string]interface{}` will need to be interpreted as a `float64`. + Should library users set it as an `int` within the index config - we'll honor that setting as well.

MB-64604: Merge remote-tracking branch 'origin/trinity-couchbase' into 'master'

go-faiss: * 2127bb0 Likith B | MB-64513: Removing modification of max codes based on filtered document size

- Allow setting up `synonym_sources` in the index mapping, which will follow its own ingest pipeline, ingesting special synonym definitions using the IndexSynonym API(). - A `synonym_source` can be set like an analyzer to a field mapping and can be set as a default option at the document mapping or the index mapping level. - Each `synonym_source` can have its own analyzer, making it flexible to allow for compatibility with the language analyzer specified for its corresponding mapping. - Compatibility with every term-based query where the term gets expanded to include its synonyms at query time. - Dependencies: - blevesearch/[email protected] - blevesearch/bleve_index_api#57 - blevesearch/[email protected] - blevesearch/scorch_segment_api#46 - blevesearch/[email protected] - blevesearch/vellum#22 - blevesearch/zapx@v16@latest - blevesearch/zapx#268 --------- Co-authored-by: Abhinav Dangeti <[email protected]>

**What this PR does:** - Adds a test file for the Snowball Turkish stemmer. - Tests various Turkish words to ensure proper stemming. **Why this is useful:** - Improves test coverage for Turkish language support in Bleve. - Ensures the Snowball stemmer works as expected for Turkish words. **Notes:** - This PR only adds a test file and does not modify any existing functionality.

…2129)

1. Changed weight of a kNN query to 1 to allow the boost value to kick in when computing query score. Since the weight of only the kNN scorer is changed, this will not impact how boosting is calculated for other types of queries. To reduce the kNN score relative to the FTS query score, set boost to <1. 2. Added a unit test which demonstrates boost increasing scores even for pure kNN queries(kNN + match none query).

Introducing support for BM25 scoring Key stats necessary for the scoring - fieldLength - the number of terms in a field within a doc. - avgDocLength - the average of terms in a field across all the docs in the index. - totalDocs - total number of docs in an index. Introduces a mechanism to maintain consistent scoring in a situation where the index is partitioned as a `bleve.IndexAlias`. This is achieved using the existing preSearch mechanism where the first phase of the entire search involves fetching the above mentioned stats, aggregating them and redistributing back to the bleve indexes which would use them while calculating the score for a hit. In order to enable this global scoring mechanism, the user needs to set the `context` argument of the SearchInContext with: `ctx = context.WithValue(ctx, search.SearchTypeKey, search.GlobalScoring)` Implementation wise, the user needs to explicitly mention BM25 as the scoring mechanism at `indexMapping.ScoringModel` level to actually use this scoring mechanism. This parameter is a global setting, i.e. when performing a search on multiple fields, all the fields are scored with the same scoring model. The storage layer exposes an API which returns the number of terms in a field's term dictionary which is used to compute the `avgDocLength`. At the indexing layer, we check if the queried field supports BM25 scoring and if consistent scoring is availed. This is followed by fetching the stats either from the local bleve index or from a context (in the case where we're availing the consistent scoring) to compute the actual score. Note: The scoring is highly dependent on the size of an individual bleve index's termDictionary (specific to a field) so there can be some discrepancies especially given that each index is further composed of multiple 'segments'. However in large scale use cases these discrepancies can be quite small and don't affect the order of the doc hits - in which case the user may choose to avoid this altogether. --------- Co-authored-by: Aditi Ahuja <[email protected]> Co-authored-by: Abhinav Dangeti <[email protected]>

…examples (#2132) - Add docs/synonyms.md to provide an overview on synonym search.

Co-authored-by: Abhi Dangeti <[email protected]>

#2103) - Previously, the `PartialMatch` field was returned for every hit, but this caused confusion in complex queries involving disjunctions and match queries. As a result, we moved `PartialMatch` to the score explanation, where each subquery's explanation will include its own `PartialMatch`. This field is set only if the query uses the DisjunctionSearcher or scorer.

Requires: * blevesearch/scorch_segment_api#49 * blevesearch/zapx#296 * blevesearch/zapx#297 * blevesearch/zapx#298 * blevesearch/zapx#299 * blevesearch/zapx#300 * blevesearch/zapx#301

- IP fields were not returned in the search response, even when stored, because they were not handled in the `LoadAndHighlightFields` API. - Requires blevesearch/bleve_index_api#60 --------- Co-authored-by: Abhinav Dangeti <[email protected]>

- Add a Scorch counter stat `TotSynonymSearches` to track the number of synonym-enabled queries received by the index. This stat will be incremented by 1 each time the FieldTermSynonymMap is set, indicating that the query will use synonyms. --------- Co-authored-by: Abhinav Dangeti <[email protected]>

|\ | * a20efc1 Aditi Ahuja | MB-64883 - Avoid redundant computation of eligible IDs (#2143)` --------- Co-authored-by: Aditi Ahuja <[email protected]>

…ate registration (#2151) - The `registry` package currently `panics` or crashes `Bleve` when a duplicate component is registered. This behavior makes it difficult to handle duplicate registrations gracefully and requires using a recover statement to prevent the application from crashing. - To improve error handling and maintainability, the newly refactored proposed API will involve changing `Register<registry-component>` to return an error instead of panicking when a duplicate registration is detected. This change would allow developers to check if a `<registry-component>` is already registered and handle it accordingly without relying on recover. - Addresses #2125

* 355d2eb Abhi Dangeti | Use tags for bleve_index_api, scorch_segment_api * 6ba256e Abhi Dangeti | [faiss_vector_posting] Refactor searchWithFilter for readability * 8187adc Abhi Dangeti | [section_faiss_vector_index] Minor refactor removing unnecessary vars

`index_bgthreads_active` indicates whether the background routines that maintain the index are busy doing some work. This means that the index hasn't converged to a steady state and there could be potential file segment merges or in-memory segment flushes still remaining. This stat is beneficial for the application layer to get better insight as to whether the index is doing some background work (which can have implications on the system's resource utilisation) or not.

The http/ dir used _only_ by bleve-explorer has been relocated to blevesearch/bleve-explorer with: blevesearch/bleve-explorer#24 Fixes: #2155

…at qualify for a type mapping (#2157) - This stat tracks the number of mutations processed by the Bleve index that are searchable by the user. It helps monitor the progress of the index-building process. - The stat is incremented for each mutation processed by the Bleve index that is associated with a valid type mapping. --------- Co-authored-by: Abhinav Dangeti <[email protected]>

moshaad7 and others added 30 commits December 21, 2023 18:14

MB-60207 fix facets merge (#1946)

e26eace

Relocate vectors.md into docs/. (#1998)

c9edd4c

MB-61216: Interpret empty query (in search request) as a match_none (…

a1e4a0e

…#2000)

Update README: supported field types (#2001)

3a10df5

MB-60791: Conform to new interfaces for VectorIndex & VectorSegment (#…

24bcdcb

…2004) This change is intended to save compute in determining "vectorIdsToExclude" during search(..). Requires: - blevesearch/scorch_segment_api#40 - blevesearch/zapx#229

Revert "MB-60971: Avoiding unnecessary persister notifs from merger (#…

5939277

…2003)" (#2005) This reverts commit 4fd8313.

Bump up zapx/v16 & move GO version to 1.21 (#2008)

543064f

Includes: * 9e2514f Aditi Ahuja | MB-60943 - Add a coarse quantiser to the IVF indexes (https://github.com/blevesearch/zapx/pulls/225)

Revert "MB-60971: Avoiding work on persister side on no-op notifs fro…

8f3a29f

…m merger" (#2010) This reverts commit 2d81bf0 ( #2006 ) on account of the regression highlighted with MB-61447.

chore: fix function names in comment (#2011)

3b57c5f

Signed-off-by: mountcount <[email protected]>

MB-60943: Bump up zapx/v16 for MB-61470 (#2013)

6171178

* 91a5e17 Abhi Dangeti | Revert "MB-60943 - Add a coarse quantiser to the IVF indexes" (blevesearch/zapx#232)

ClientContextID is missing from SearchRequest when built with `vector…

f977418

…s` (#2014)

Fixed TestVectorBase64 (#2017)

24d85e7

Fix unit tests (#2018)

642b034

- Fix unit tests that were racy and failed intermittently.

MB-61009: Raise max supported dimensionality of vectors to 4096 (#2015)

63e9c3f

+ Allow calling application to change min/max. + Will require downstream changes to disallow any setting greater than 2048 in the event of a mixed version cluster.

Disable upside_down's boltdb tests on darwin arm64 (#2020)

e687956

+ Test(s) in this file hang on my computer and it seems they've started hanging on the workflow jobs as well.

MB-60719: Do not omit score from document hits, even when empty (#2022

c742605

) + Omitting the score from the document hits when score:"none" causes an older SDK to panic that was expecting to see the attribute regardless. + This commit reverts a portion of what was proposed with #1930.

chore: fix function name (#2021)

5ae70e3

MB-61650: Address memory leak in faiss query path (#2023)

d76da64

blevesearch/go-faiss: * 693b06a Rahul Rampure | MB-61650: Release IDSelectorBatch's batchSelector to avoid memory leak

MB-60697: Upgrade blevesearch/go-faiss for windows fix (#2024)

1482c12

Includes: * d8f2ddf Abhi Dangeti | MB-60697: Windows requires nprobe to be of 'long long' type * 9bb55f8 Abhi Dangeti | Retain IDSelector's Delete() API for complete-ness

MB-60943: Bump up zapx/v16 (#2025)

9a4c565

Includes: * 6fe4e6b Aditi Ahuja | MB-60943 - Reduce number of centroids for IVF indexes. * 8de5651 Rahul Rampure | add map capacity

Fix doc's json field interpretation in example (#2029)

c76f76d

Addresses: #2028

CascadingRadium and others added 29 commits December 17, 2024 14:22

MB-64604: Remove unnecessary second map lookup (#2121)

5c53634

Merge pull request #2120 from blevesearch/mb64604_forwardmerge

a8d41b3

MB-64604: Merge remote-tracking branch 'origin/trinity-couchbase' into 'master'

MB-64513: Upgrade blevesearch/go-faiss, zapx/v16 for fix (#2122)

3e63d1c

go-faiss: * 2127bb0 Likith B | MB-64513: Removing modification of max codes based on filtered document size

Update title in vectors.md and add place holder for synonyms README (#…

2ef6286

…2129)

Update READMEs, new placeholders (#2131)

bf28642

Add synonym search documentation with detailed indexing and querying …

4a8034b

…examples (#2132) - Add docs/synonyms.md to provide an overview on synonym search.

Update documentation around scoring (#2133)

b7b67d3

Co-authored-by: Abhi Dangeti <[email protected]>

Update to RoaringBitmap/roaring's v2.4.4 (#2135)

c30d0a1

Requires: * blevesearch/scorch_segment_api#49 * blevesearch/zapx#296 * blevesearch/zapx#297 * blevesearch/zapx#298 * blevesearch/zapx#299 * blevesearch/zapx#300 * blevesearch/zapx#301

Upgrade to [email protected] (#2139)

de678d6

MB-64766: Decoding geo distances from the sort results (#2137)

0cc82a6

Merge remote-tracking branch 'origin/7.6.x-couchbase' (#2148)

5bb86d4

|\ | * a20efc1 Aditi Ahuja | MB-64883 - Avoid redundant computation of eligible IDs (#2143)` --------- Co-authored-by: Aditi Ahuja <[email protected]>

Update docs/vectors.md for v2.5.0 (#2145)

de9c49c

Typo in docs/vectors.md (#2149)

fdc60eb

Update README with new features available in v2.5.0 (#2154)

166dadb

Remove http/ from bleve for CVE-2022-31022 (#2156)

af9e311

The http/ dir used _only_ by bleve-explorer has been relocated to blevesearch/bleve-explorer with: blevesearch/bleve-explorer#24 Fixes: #2155

cmrajan merged commit 5238510 into cmrajan:master Mar 7, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Merge blevesearch master changes #1

Merge blevesearch master changes #1

Uh oh!

cmrajan commented Mar 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Uh oh!

Merge blevesearch master changes #1

Merge blevesearch master changes #1

Uh oh!

Conversation

cmrajan commented Mar 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants