forked from blevesearch/bleve
-
Couldn't load subscription status.
- Fork 0
Merge blevesearch master changes #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Normalization of accented letters only happens if the input is larger than 5 characters, something that, for example, neither `guía` nor `fría` comply. The solution would be to always execute the accented characters normalization, by moving it to a separate file just like it is done in the german analyzer. Fixes: #1956
Co-authored-by: Rahul Rampure <[email protected]> Co-authored-by: Aditi Ahuja <[email protected]> Co-authored-by: Likith B <[email protected]> Co-authored-by: Mohd Shaad Khan <[email protected]> Co-authored-by: Thejas-bhat <[email protected]>
- Instead of Merging all the PreSearch Results in one shot at the main coordinator node of an alias tree, merge them incrementally at each level of the tree instead, which would balance the reduction process across all the indexes in a distributed Bleve index, leading to a more even memory distribution. --------- Co-authored-by: Abhinav Dangeti <[email protected]>
…2004) This change is intended to save compute in determining "vectorIdsToExclude" during search(..). Requires: - blevesearch/scorch_segment_api#40 - blevesearch/zapx#229
- There are chances that the merger doesn't see any eligible segments to be merged in the current iteration which causes the tasks list to be empty. In this situation, merger which didn't update the root snapshot would notify the persister. - Now, if the persister was napping at this point of time, and assuming there were mutations coming into the system (so the root snapshot would be updated by the introducer) would lead to the persister to be awoken and start flushing out the in-memory segments to disk. - Perhaps a better behaviour would be to not let the merger notify the persister when there is no change in the root snapshot. This would help the persister to perform healthier in memory merging of the segments before persisting out to disk thereby helping in controlling the number of IO ops scorch would do. Some numbers on local testing (~4.18M dataset with lorem ipsum content) With patch ``` "num_bytes_used_ram": 186097560, "test_bucket:test_bucket._default.test:num_file_merge_ops": 7, "test_bucket:test_bucket._default.test:num_mutations_to_index": 0, "test_bucket:test_bucket._default.test:num_persister_nap_merger_break": 4, "test_bucket:test_bucket._default.test:num_persister_nap_pause_completed": 80, ``` Without patch ``` "num_bytes_used_ram": 265234328, "test_bucket:test_bucket._default.test:num_file_merge_ops": 10, "test_bucket:test_bucket._default.test:num_mutations_to_index": 0, "test_bucket:test_bucket._default.test:num_persister_nap_merger_break": 69, "test_bucket:test_bucket._default.test:num_persister_nap_pause_completed": 45, ``` --------- Co-authored-by: Abhinav Dangeti <[email protected]>
Includes: * 9e2514f Aditi Ahuja | MB-60943 - Add a coarse quantiser to the IVF indexes (https://github.com/blevesearch/zapx/pulls/225)
…#2006) Authored-by: @Thejas-bhat Original: #2003 - There are chances that the merger doesn't see any eligible segments to be merged in the current iteration which causes the tasks list to be empty. In this situation, merger which didn't update the root snapshot would notify the persister. - Now, if the persister was napping at this point of time, and assuming there were mutations coming into the system (so the root snapshot would be updated by the introducer) would lead to the persister to be awoken and start flushing out the in-memory segments to disk. - Perhaps a better behaviour would be to let the persister nap for the remaining duration such and then let the persister do some work. This would also help in merger waiting for the notification reply (which is like an interrupt fashion type of wait) rather than doing something more expensive of letting merger continue to do work (which the earlier commits of this PR was doing). Some numbers on local testing (~4.18M dataset with lorem ipsum content) With patch ``` "num_bytes_used_ram": 224100280, "num_files_on_disk": 34, "test_bucket:test_bucket._default.travel:num_file_merge_ops": 6, "test_bucket:test_bucket._default.travel:num_file_merge_plan": 0, "test_bucket:test_bucket._default.travel:num_files_on_disk": 34, "test_bucket:test_bucket._default.travel:num_mem_merge_ops": 70, "test_bucket:test_bucket._default.travel:num_persister_nap_merger_break": 4, "test_bucket:test_bucket._default.travel:num_persister_nap_pause_completed": 69, "test_bucket:test_bucket._default.travel:num_root_filesegments": 16, "test_bucket:test_bucket._default.travel:num_root_memorysegments": 0, "TotFileMergePlan": 45, "TotFileMergePlanErr": 0, "TotFileMergePlanNone": 1, "TotFileMergePlanOk": 44, ``` Without patch ``` "num_bytes_used_ram": 252726152, "num_files_on_disk": 45, "test_bucket:test_bucket._default.travel:num_file_merge_ops": 11, "test_bucket:test_bucket._default.travel:num_file_merge_plan": 0, "test_bucket:test_bucket._default.travel:num_files_on_disk": 45, "test_bucket:test_bucket._default.travel:num_mem_merge_ops": 129, "test_bucket:test_bucket._default.travel:num_persister_nap_merger_break": 85, "test_bucket:test_bucket._default.travel:num_persister_nap_pause_completed": 44, "test_bucket:test_bucket._default.travel:num_root_filesegments": 33, "test_bucket:test_bucket._default.travel:num_root_memorysegments": 0, "TotFileMergePlan": 96, "TotFileMergePlanErr": 0, "TotFileMergePlanNone": 1, "TotFileMergePlanOk": 95, ``` --------- Co-authored-by: Thejas-bhat <[email protected]>
Signed-off-by: mountcount <[email protected]>
* 91a5e17 Abhi Dangeti | Revert "MB-60943 - Add a coarse quantiser to the IVF indexes" (blevesearch/zapx#232)
Includes: * eeb2336 Likith B | MB-61029: Caching Vec To DocID Map (blevesearch/zapx#231) * b2384fc Rahul Rampure | minor optimizations and bug fixes (blevesearch/zapx#233) * b56abea Thejas-bhat | MB-61029: Deferring the closing of vector index (blevesearch/zapx#226)
- Added a new field type called vector_base64. - Acts similar to vector in most cases. - When a new document arrives in the bleve layer, during the parsing of all its fields in processProperty, if the field mapping type is vector-base64, then its value is decoded into a vector field and processed like a vector. - The standard golang base64 library is used for the decode operation. --------- Co-authored-by: Abhinav Dangeti <[email protected]>
- Fix unit tests that were racy and failed intermittently.
- Fix the BytesRead and BytesStored unit tests to pass when using the
`vectors` tag. The following commands can be used to validate that all
UTs now pass.
- go test -tags=vectors -race ./...
- go test -race ./...
---------
Co-authored-by: Abhinav Dangeti <[email protected]>
+ Allow calling application to change min/max. + Will require downstream changes to disallow any setting greater than 2048 in the event of a mixed version cluster.
+ Test(s) in this file hang on my computer and it seems they've started hanging on the workflow jobs as well.
) + Omitting the score from the document hits when score:"none" causes an older SDK to panic that was expecting to see the attribute regardless. + This commit reverts a portion of what was proposed with #1930.
blevesearch/go-faiss: * 693b06a Rahul Rampure | MB-61650: Release IDSelectorBatch's batchSelector to avoid memory leak
Includes: * d8f2ddf Abhi Dangeti | MB-60697: Windows requires nprobe to be of 'long long' type * 9bb55f8 Abhi Dangeti | Retain IDSelector's Delete() API for complete-ness
Includes: * 6fe4e6b Aditi Ahuja | MB-60943 - Reduce number of centroids for IVF indexes. * 8de5651 Rahul Rampure | add map capacity
…plan computation (#2002) Existing merge policy use a segment's total and live docs count as the measure for determining the merge tasks. A doc may contain multiple vector fields ( even chunked vectors ) , of varying vector dimensions, this usually means that index size will be greater than docs size by orders of magnitude. Thus, docs count as merge policy measure, can easily lead us to formation of huge segments. That's a problem during merge and search time. Thus, this PR intends to utilize segment fileSize as an additional limiting check, to ensure that we don't end up creating huge segments. example: 2M docs, each with a vector field of 2048 dimensions size := 2000000 * 2048 * 4 bytes = 16GB
- Refactor the presearch code path to make it more generic and extensible. - Add an ExtractFields API for obtaining the set of fields applicable to a generic query. - Add support for a new API to set the index mapping for an alias. This is useful when an alias contains partitions of the same index, as the index mapping would be consistent across all indexes and can be inferred directly from the alias.
…2117) + Setting the default to 0 on account of the panics caught in the MB. + Firstly, refactor `FieldTFRCacheThreshold` to `fieldTFRCacheThreshold` for _some_ naming consistency here. + This threshold can be used to toggle recycling of TermFieldReaders on/off. + Couchbase users will have the ability to provide this setting within a JSON payload, which when interpreted into a `map[string]interface{}` will need to be interpreted as a `float64`. + Should library users set it as an `int` within the index config - we'll honor that setting as well.
MB-64604: Merge remote-tracking branch 'origin/trinity-couchbase' into 'master'
go-faiss: * 2127bb0 Likith B | MB-64513: Removing modification of max codes based on filtered document size
- Allow setting up `synonym_sources` in the index mapping, which will follow its own ingest pipeline, ingesting special synonym definitions using the IndexSynonym API(). - A `synonym_source` can be set like an analyzer to a field mapping and can be set as a default option at the document mapping or the index mapping level. - Each `synonym_source` can have its own analyzer, making it flexible to allow for compatibility with the language analyzer specified for its corresponding mapping. - Compatibility with every term-based query where the term gets expanded to include its synonyms at query time. - Dependencies: - blevesearch/[email protected] - blevesearch/bleve_index_api#57 - blevesearch/[email protected] - blevesearch/scorch_segment_api#46 - blevesearch/[email protected] - blevesearch/vellum#22 - blevesearch/zapx@v16@latest - blevesearch/zapx#268 --------- Co-authored-by: Abhinav Dangeti <[email protected]>
**What this PR does:** - Adds a test file for the Snowball Turkish stemmer. - Tests various Turkish words to ensure proper stemming. **Why this is useful:** - Improves test coverage for Turkish language support in Bleve. - Ensures the Snowball stemmer works as expected for Turkish words. **Notes:** - This PR only adds a test file and does not modify any existing functionality.
1. Changed weight of a kNN query to 1 to allow the boost value to kick in when computing query score. Since the weight of only the kNN scorer is changed, this will not impact how boosting is calculated for other types of queries. To reduce the kNN score relative to the FTS query score, set boost to <1. 2. Added a unit test which demonstrates boost increasing scores even for pure kNN queries(kNN + match none query).
Introducing support for BM25 scoring Key stats necessary for the scoring - fieldLength - the number of terms in a field within a doc. - avgDocLength - the average of terms in a field across all the docs in the index. - totalDocs - total number of docs in an index. Introduces a mechanism to maintain consistent scoring in a situation where the index is partitioned as a `bleve.IndexAlias`. This is achieved using the existing preSearch mechanism where the first phase of the entire search involves fetching the above mentioned stats, aggregating them and redistributing back to the bleve indexes which would use them while calculating the score for a hit. In order to enable this global scoring mechanism, the user needs to set the `context` argument of the SearchInContext with: `ctx = context.WithValue(ctx, search.SearchTypeKey, search.GlobalScoring)` Implementation wise, the user needs to explicitly mention BM25 as the scoring mechanism at `indexMapping.ScoringModel` level to actually use this scoring mechanism. This parameter is a global setting, i.e. when performing a search on multiple fields, all the fields are scored with the same scoring model. The storage layer exposes an API which returns the number of terms in a field's term dictionary which is used to compute the `avgDocLength`. At the indexing layer, we check if the queried field supports BM25 scoring and if consistent scoring is availed. This is followed by fetching the stats either from the local bleve index or from a context (in the case where we're availing the consistent scoring) to compute the actual score. Note: The scoring is highly dependent on the size of an individual bleve index's termDictionary (specific to a field) so there can be some discrepancies especially given that each index is further composed of multiple 'segments'. However in large scale use cases these discrepancies can be quite small and don't affect the order of the doc hits - in which case the user may choose to avoid this altogether. --------- Co-authored-by: Aditi Ahuja <[email protected]> Co-authored-by: Abhinav Dangeti <[email protected]>
…examples (#2132) - Add docs/synonyms.md to provide an overview on synonym search.
Co-authored-by: Abhi Dangeti <[email protected]>
#2103) - Previously, the `PartialMatch` field was returned for every hit, but this caused confusion in complex queries involving disjunctions and match queries. As a result, we moved `PartialMatch` to the score explanation, where each subquery's explanation will include its own `PartialMatch`. This field is set only if the query uses the DisjunctionSearcher or scorer.
- IP fields were not returned in the search response, even when stored, because they were not handled in the `LoadAndHighlightFields` API. - Requires blevesearch/bleve_index_api#60 --------- Co-authored-by: Abhinav Dangeti <[email protected]>
- Add a Scorch counter stat `TotSynonymSearches` to track the number of synonym-enabled queries received by the index. This stat will be incremented by 1 each time the FieldTermSynonymMap is set, indicating that the query will use synonyms. --------- Co-authored-by: Abhinav Dangeti <[email protected]>
|\ | * a20efc1 Aditi Ahuja | MB-64883 - Avoid redundant computation of eligible IDs (#2143)` --------- Co-authored-by: Aditi Ahuja <[email protected]>
…ate registration (#2151) - The `registry` package currently `panics` or crashes `Bleve` when a duplicate component is registered. This behavior makes it difficult to handle duplicate registrations gracefully and requires using a recover statement to prevent the application from crashing. - To improve error handling and maintainability, the newly refactored proposed API will involve changing `Register<registry-component>` to return an error instead of panicking when a duplicate registration is detected. This change would allow developers to check if a `<registry-component>` is already registered and handle it accordingly without relying on recover. - Addresses #2125
* 355d2eb Abhi Dangeti | Use tags for bleve_index_api, scorch_segment_api * 6ba256e Abhi Dangeti | [faiss_vector_posting] Refactor searchWithFilter for readability * 8187adc Abhi Dangeti | [section_faiss_vector_index] Minor refactor removing unnecessary vars
`index_bgthreads_active` indicates whether the background routines that maintain the index are busy doing some work. This means that the index hasn't converged to a steady state and there could be potential file segment merges or in-memory segment flushes still remaining. This stat is beneficial for the application layer to get better insight as to whether the index is doing some background work (which can have implications on the system's resource utilisation) or not.
The http/ dir used _only_ by bleve-explorer has been relocated to blevesearch/bleve-explorer with: blevesearch/bleve-explorer#24 Fixes: #2155
…at qualify for a type mapping (#2157) - This stat tracks the number of mutations processed by the Bleve index that are searchable by the user. It helps monitor the progress of the index-building process. - The stat is incremented for each mutation processed by the Bleve index that is associated with a valid type mapping. --------- Co-authored-by: Abhinav Dangeti <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.