Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@cmrajan
Copy link
Owner

@cmrajan cmrajan commented Mar 7, 2025

No description provided.

moshaad7 and others added 30 commits December 21, 2023 18:14
Normalization of accented letters only happens if the input is larger
than 5 characters, something that, for example, neither `guía` nor
`fría` comply.
The solution would be to always execute the accented characters
normalization, by moving it to a separate file just like it is done in
the german analyzer.

Fixes: #1956
Co-authored-by: Rahul Rampure <[email protected]>
Co-authored-by: Aditi Ahuja <[email protected]>
Co-authored-by: Likith B <[email protected]>
Co-authored-by: Mohd Shaad Khan <[email protected]>
Co-authored-by: Thejas-bhat <[email protected]>
- Instead of Merging all the PreSearch Results in one shot at the main
coordinator node of an alias tree, merge them incrementally at each
level of the tree instead, which would balance the reduction process
across all the indexes in a distributed Bleve index, leading to a more
even memory distribution.

---------

Co-authored-by: Abhinav Dangeti <[email protected]>
…2004)

This change is intended to save compute in determining
"vectorIdsToExclude" during search(..).

Requires:
- blevesearch/scorch_segment_api#40
- blevesearch/zapx#229
- There are chances that the merger doesn't see any eligible segments to
be merged in the current iteration which causes the tasks list to be
empty. In this situation, merger which didn't update the root snapshot
would notify the persister.
- Now, if the persister was napping at this point of time, and assuming
there were mutations coming into the system (so the root snapshot would
be updated by the introducer) would lead to the persister to be awoken
and start flushing out the in-memory segments to disk.
- Perhaps a better behaviour would be to not let the merger notify the
persister when there is no change in the root snapshot. This would help
the persister to perform healthier in memory merging of the segments
before persisting out to disk thereby helping in controlling the number
of IO ops scorch would do.

Some numbers on local testing (~4.18M dataset with lorem ipsum content)

With patch
```
  "num_bytes_used_ram": 186097560,
  "test_bucket:test_bucket._default.test:num_file_merge_ops": 7,
  "test_bucket:test_bucket._default.test:num_mutations_to_index": 0,
  "test_bucket:test_bucket._default.test:num_persister_nap_merger_break": 4,
  "test_bucket:test_bucket._default.test:num_persister_nap_pause_completed": 80,
```

Without patch
```
  "num_bytes_used_ram": 265234328,
  "test_bucket:test_bucket._default.test:num_file_merge_ops": 10,
  "test_bucket:test_bucket._default.test:num_mutations_to_index": 0,
  "test_bucket:test_bucket._default.test:num_persister_nap_merger_break": 69,
  "test_bucket:test_bucket._default.test:num_persister_nap_pause_completed": 45,
```

---------

Co-authored-by: Abhinav Dangeti <[email protected]>
Includes:
* 9e2514f Aditi Ahuja | MB-60943 - Add a coarse quantiser to the IVF
indexes (https://github.com/blevesearch/zapx/pulls/225)
…#2006)

Authored-by: @Thejas-bhat 
Original: #2003

- There are chances that the merger doesn't see any eligible segments to
be merged in the current iteration which causes the tasks list to be
empty. In this situation, merger which didn't update the root snapshot
would notify the persister.
- Now, if the persister was napping at this point of time, and assuming
there were mutations coming into the system (so the root snapshot would
be updated by the introducer) would lead to the persister to be awoken
and start flushing out the in-memory segments to disk.
- Perhaps a better behaviour would be to let the persister nap for the
remaining
duration such and then let the persister do some work. This would also
help in merger
waiting for the notification reply (which is like an interrupt fashion
type of wait)
rather than doing something more expensive of letting merger continue to
do work
(which the earlier commits of this PR was doing). 

Some numbers on local testing (~4.18M dataset with lorem ipsum content)

With patch
```
"num_bytes_used_ram": 224100280,
"num_files_on_disk": 34,
"test_bucket:test_bucket._default.travel:num_file_merge_ops": 6,
"test_bucket:test_bucket._default.travel:num_file_merge_plan": 0,
"test_bucket:test_bucket._default.travel:num_files_on_disk": 34,
"test_bucket:test_bucket._default.travel:num_mem_merge_ops": 70,
"test_bucket:test_bucket._default.travel:num_persister_nap_merger_break": 4,
"test_bucket:test_bucket._default.travel:num_persister_nap_pause_completed": 69,
"test_bucket:test_bucket._default.travel:num_root_filesegments": 16,
"test_bucket:test_bucket._default.travel:num_root_memorysegments": 0,

"TotFileMergePlan": 45,
"TotFileMergePlanErr": 0,
"TotFileMergePlanNone": 1,
"TotFileMergePlanOk": 44,
```

Without patch
```
"num_bytes_used_ram": 252726152,
"num_files_on_disk": 45,
"test_bucket:test_bucket._default.travel:num_file_merge_ops": 11,
"test_bucket:test_bucket._default.travel:num_file_merge_plan": 0,
"test_bucket:test_bucket._default.travel:num_files_on_disk": 45,
"test_bucket:test_bucket._default.travel:num_mem_merge_ops": 129,
"test_bucket:test_bucket._default.travel:num_persister_nap_merger_break": 85,
"test_bucket:test_bucket._default.travel:num_persister_nap_pause_completed": 44,
"test_bucket:test_bucket._default.travel:num_root_filesegments": 33,
"test_bucket:test_bucket._default.travel:num_root_memorysegments": 0,

"TotFileMergePlan": 96,
"TotFileMergePlanErr": 0,
"TotFileMergePlanNone": 1,
"TotFileMergePlanOk": 95,

```

---------

Co-authored-by: Thejas-bhat <[email protected]>
…m merger" (#2010)

This reverts commit 2d81bf0 (
#2006 ) on account of the
regression highlighted with MB-61447.
* 91a5e17 Abhi Dangeti | Revert "MB-60943 - Add a coarse quantiser to
the IVF indexes" (blevesearch/zapx#232)
Includes:
* eeb2336 Likith B | MB-61029: Caching Vec To DocID Map
(blevesearch/zapx#231)
* b2384fc Rahul Rampure | minor optimizations and bug fixes
(blevesearch/zapx#233)
* b56abea Thejas-bhat | MB-61029: Deferring the closing of vector index
(blevesearch/zapx#226)
- Added a new field type called vector_base64.
 - Acts similar to vector in most cases.
- When a new document arrives in the bleve layer, during the parsing of
all its fields in processProperty, if the field mapping type is
vector-base64, then its value is decoded into a vector field and
processed like a vector.
 - The standard golang base64 library is used for the decode operation.

---------

Co-authored-by: Abhinav Dangeti <[email protected]>
- Fix unit tests that were racy and failed intermittently.
- Fix the BytesRead and BytesStored unit tests to pass when using the
`vectors` tag. The following commands can be used to validate that all
UTs now pass.
    - go test -tags=vectors -race ./...
    - go test -race ./...

---------

Co-authored-by: Abhinav Dangeti <[email protected]>
+ Allow calling application to change min/max.
+ Will require downstream changes to disallow any setting greater
  than 2048 in the event of a mixed version cluster.
+ Test(s) in this file hang on my computer and it seems they've started
hanging on the workflow jobs as well.
)

+ Omitting the score from the document hits when score:"none" causes an
older SDK to panic that was expecting to see the attribute regardless.
+ This commit reverts a portion of what was proposed with
#1930.
blevesearch/go-faiss:
* 693b06a Rahul Rampure | MB-61650: Release IDSelectorBatch's
batchSelector to avoid memory leak
Includes:
* d8f2ddf Abhi Dangeti | MB-60697: Windows requires nprobe to be of
'long long' type
* 9bb55f8 Abhi Dangeti | Retain IDSelector's Delete() API for
complete-ness
Includes:
* 6fe4e6b Aditi Ahuja | MB-60943 - Reduce number of centroids for IVF
indexes.
* 8de5651 Rahul Rampure | add map capacity
…plan computation (#2002)

Existing merge policy use a segment's total and live docs count as the
measure for determining the merge tasks.
A doc may contain multiple vector fields ( even chunked vectors ) , of
varying vector dimensions, this usually means that index size will be
greater than docs size by orders of magnitude.

Thus, docs count as merge policy measure, can easily lead us to
formation of huge segments. That's a problem during merge and search
time.

Thus, this PR intends to utilize segment fileSize as an additional
limiting check, to ensure that we don't end up creating huge segments.


example:
2M docs, each with a vector field of 2048 dimensions
size := 2000000 * 2048 * 4 bytes = 16GB
CascadingRadium and others added 29 commits December 17, 2024 14:22
- Refactor the presearch code path to make it more generic and
extensible.
- Add an ExtractFields API for obtaining the set of fields applicable to
a generic query.
- Add support for a new API to set the index mapping for an alias.
This is useful when an alias contains partitions of the same index, as
the index
mapping would be consistent across all indexes and can be inferred
directly from
  the alias.
…2117)

+ Setting the default to 0 on account of the panics caught in the MB.
+ Firstly, refactor `FieldTFRCacheThreshold` to `fieldTFRCacheThreshold`
for _some_ naming consistency here.
+ This threshold can be used to toggle recycling of TermFieldReaders
on/off.
+ Couchbase users will have the ability to provide this setting within a
JSON payload, which when interpreted into a `map[string]interface{}`
will need to be interpreted as a `float64`.
+ Should library users set it as an `int` within the index config -
we'll honor that setting as well.
…o 'master'

|\
| * 5c53634 Abhi Dangeti | MB-64604: Remove unnecessary second map lookup (#2121)
| * 78cf789 Abhi Dangeti | MB-64604: Fix interpreting scorch config: "fieldTFRCacheThreshold" (#2117)
| * 7d627b9 Aditi Ahuja | MB-64360 - Upgrade zapx v16 (#2107)
MB-64604: Merge remote-tracking branch 'origin/trinity-couchbase' into 'master'
go-faiss:
* 2127bb0 Likith B | MB-64513: Removing modification of max codes based
on filtered document size
- Allow setting up `synonym_sources` in the index mapping, which will
follow its own ingest pipeline, ingesting special synonym definitions
using the IndexSynonym API().
- A `synonym_source` can be set like an analyzer to a field mapping and
can be set as a default option at the document mapping or the index
mapping level.
- Each `synonym_source` can have its own analyzer, making it flexible to
allow for compatibility with the language analyzer specified for its
corresponding mapping.
- Compatibility with every term-based query where the term gets expanded
to include its synonyms at query time.
- Dependencies:
- blevesearch/[email protected] -
blevesearch/bleve_index_api#57
- blevesearch/[email protected] -
blevesearch/scorch_segment_api#46
- blevesearch/[email protected] -
blevesearch/vellum#22
- blevesearch/zapx@v16@latest -
blevesearch/zapx#268

---------

Co-authored-by: Abhinav Dangeti <[email protected]>
**What this PR does:**
- Adds a test file for the Snowball Turkish stemmer.
- Tests various Turkish words to ensure proper stemming.

**Why this is useful:**
- Improves test coverage for Turkish language support in Bleve.
- Ensures the Snowball stemmer works as expected for Turkish words.

**Notes:**
- This PR only adds a test file and does not modify any existing
functionality.
1. Changed weight of a kNN query to 1 to allow the boost value to kick
in when computing query score.
Since the weight of only the kNN scorer is changed, this will not impact
how boosting is calculated for other types of queries.

To reduce the kNN score relative to the FTS query score, set boost to
<1.

2. Added a unit test which demonstrates boost increasing scores even for
pure kNN queries(kNN + match none query).
Introducing support for BM25 scoring 

Key stats necessary for the scoring
- fieldLength - the number of terms in a field within a doc.
- avgDocLength - the average of terms in a field across all the docs in
the index.
- totalDocs - total number of docs in an index.

Introduces a mechanism to maintain consistent scoring in a situation
where the index is partitioned as a `bleve.IndexAlias`. This is achieved
using the existing preSearch mechanism where the first phase of the
entire search involves fetching the above mentioned stats, aggregating
them and redistributing back to the bleve indexes which would use them
while calculating the score for a hit. In order to enable this global
scoring mechanism, the user needs to set the `context` argument of the
SearchInContext with:
`ctx = context.WithValue(ctx, search.SearchTypeKey,
search.GlobalScoring)`

Implementation wise, the user needs to explicitly mention BM25 as the
scoring mechanism at `indexMapping.ScoringModel` level to actually use
this scoring mechanism. This parameter is a global setting, i.e. when
performing a search on multiple fields, all the fields are scored with
the same scoring model.
The storage layer exposes an API which returns the number of terms in a
field's term dictionary which is used to compute the `avgDocLength`. At
the indexing layer, we check if the queried field supports BM25 scoring
and if consistent scoring is availed. This is followed by fetching the
stats either from the local bleve index or from a context (in the case
where we're availing the consistent scoring) to compute the actual
score.


Note: The scoring is highly dependent on the size of an individual bleve
index's termDictionary (specific to a field) so there can be some
discrepancies especially given that each index is further composed of
multiple 'segments'. However in large scale use cases these
discrepancies can be quite small and don't affect the order of the doc
hits - in which case the user may choose to avoid this altogether.

---------

Co-authored-by: Aditi Ahuja <[email protected]>
Co-authored-by: Abhinav Dangeti <[email protected]>
…examples (#2132)

- Add docs/synonyms.md to provide an overview on synonym search.
#2103)

- Previously, the `PartialMatch` field was returned for every hit, but
this caused confusion in complex queries involving disjunctions and
match queries. As a result, we moved `PartialMatch` to the score
explanation, where each subquery's explanation will include its own
`PartialMatch`. This field is set only if the query uses the
DisjunctionSearcher or scorer.
- IP fields were not returned in the search response, even when stored,
because they were not
  handled in the `LoadAndHighlightFields` API.
- Requires blevesearch/bleve_index_api#60

---------

Co-authored-by: Abhinav Dangeti <[email protected]>
- Add a Scorch counter stat `TotSynonymSearches` to track the number of
synonym-enabled queries received by the index. This stat will be
incremented by 1 each time the FieldTermSynonymMap is set, indicating
that the query will use synonyms.

---------

Co-authored-by: Abhinav Dangeti <[email protected]>
|\
| * a20efc1 Aditi Ahuja | MB-64883 - Avoid redundant computation of
eligible IDs (#2143)`

---------

Co-authored-by: Aditi Ahuja <[email protected]>
…ate registration (#2151)

- The `registry` package currently `panics` or crashes `Bleve` when a
duplicate component is registered. This behavior makes it difficult to
handle duplicate registrations gracefully and requires using a recover
statement to prevent the application from crashing.
- To improve error handling and maintainability, the newly refactored
proposed API will involve changing `Register<registry-component>` to
return an error instead of panicking when a duplicate registration is
detected. This change would allow developers to check if a
`<registry-component>` is already registered and handle it accordingly
without relying on recover.
- Addresses #2125
* 355d2eb Abhi Dangeti | Use tags for bleve_index_api,
scorch_segment_api
* 6ba256e Abhi Dangeti | [faiss_vector_posting] Refactor
searchWithFilter for readability
* 8187adc Abhi Dangeti | [section_faiss_vector_index] Minor refactor
removing unnecessary vars
`index_bgthreads_active` indicates whether the background routines that
maintain the index are busy doing some work. This means that the index
hasn't converged to a steady state and there could be potential file
segment merges or in-memory segment flushes still remaining.
This stat is beneficial for the application layer to get better insight
as to whether the index is doing some background work (which can have
implications on the system's resource utilisation) or not.
The http/ dir used _only_ by bleve-explorer has been relocated to
blevesearch/bleve-explorer with:
blevesearch/bleve-explorer#24

Fixes: #2155
…at qualify for a type mapping (#2157)

- This stat tracks the number of mutations processed by the Bleve index
that
are searchable by the user. It helps monitor the progress of the
index-building
  process.
- The stat is incremented for each mutation processed by the Bleve index
that
  is associated with a valid type mapping.

---------

Co-authored-by: Abhinav Dangeti <[email protected]>
@cmrajan cmrajan merged commit 5238510 into cmrajan:master Mar 7, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.