Thanks to visit codestin.com
Credit goes to github.com

Skip to content

tensorchord/VectorChord-bm25

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

76 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

VectorChord Logo

discord invitation link Twitter

VectorChord-BM25 is a PostgreSQL extension for bm25 ranking algorithm. We implemented the Block-WeakAnd Algorithms for BM25 ranking inside PostgreSQL. It's recommended to be used with pg_tokenizer.rs for customized tokenization.

Getting Started

For new users, we recommend using tensorchord/vchord-suite image to get started quickly, you can find more details in the VectorChord-images repository.

docker run \
  --name vchord-suite \
  -e POSTGRES_PASSWORD=postgres \
  -p 5432:5432 \
  -d tensorchord/vchord-suite:pg18-latest
  # If you want to use ghcr image, you can change the image to `ghcr.io/tensorchord/vchord-suite:pg18-latest`.
  # if you want to use the specific version, you can use the tag `pg17-20250414`, supported version can be found in the support matrix.

Once everythingโ€™s set up, you can connect to the database using the psql command line tool. The default username is postgres, and the default password is postgres. Hereโ€™s how to connect:

psql -h localhost -p 5432 -U postgres

After connecting, run the following SQL to make sure the extension is enabled:

CREATE EXTENSION IF NOT EXISTS pg_tokenizer CASCADE;  -- for tokenizer
CREATE EXTENSION IF NOT EXISTS vchord_bm25 CASCADE;   -- for bm25 ranking

Usage

The extension is mainly composed by three parts, tokenizer, bm25vector and bm25vector index. The tokenizer is used to convert the text into a bm25vector, and the bm25vector is similar to a sparse vector, which stores the vocabulary id and frequency. The bm25vector index is used to speed up the search and ranking process.

To tokenize a text, you can use the tokenize function. The tokenize function takes two arguments, the text to tokenize and the tokenizer name.

Note

Tokenizer part is completed by a separate extension pg_tokenizer.rs, more details can be found here.

-- create a tokenizer
SELECT create_tokenizer('bert', $$
model = "bert_base_uncased"  # using pre-trained model
$$);
-- tokenize text with bert tokenizer
SELECT tokenize('A quick brown fox jumps over the lazy dog.', 'bert')::bm25vector;
-- Output: {1012:1, 1037:1, 1996:1, 2058:1, 2829:1, 3899:1, 4248:1, 4419:1, 13971:1, 14523:1}
-- The output is a bm25vector, 1012:1 means the word with id 1012 appears once in the text.

One thing special about bm25 score is that it depends on a global document frequency, which means the score of a word in a document depends on the frequency of the word in all documents. To calculate the bm25 score between a bm25vector and a query, you need had a document set first and then use the <&> operator.

-- Setup the document table
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    passage TEXT,
    embedding bm25vector
);

INSERT INTO documents (passage) VALUES
('PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.'),
('Full-text search is a technique for searching in plain-text documents or textual database fields. PostgreSQL supports this with tsvector.'),
('BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query.'),
('PostgreSQL provides many advanced features like full-text search, window functions, and more.'),
('Search and ranking in databases are important in building effective information retrieval systems.'),
('The BM25 ranking algorithm is derived from the probabilistic retrieval framework.'),
('Full-text search indexes documents to allow fast text queries. PostgreSQL supports this through its GIN and GiST indexes.'),
('The PostgreSQL community is active and regularly improves the database system.'),
('Relational databases such as PostgreSQL can handle both structured and unstructured data.'),
('Effective search ranking algorithms, such as BM25, improve search results by understanding relevance.');

Then tokenize it

UPDATE documents SET embedding = tokenize(passage, 'bert');

Create the index on the bm25vector column so that we can collect the global document frequency.

CREATE INDEX documents_embedding_bm25 ON documents USING bm25 (embedding bm25_ops);

Now we can calculate the BM25 score between the query and the vectors. Note that the BM25 score in VectorChord-BM25 is negative, which means the more negative the score, the more relevant the document is. We intentionally make it negative so that you can use the default order by to get the most relevant documents first.

-- to_bm25query(index_name, query, tokenizer_name)
-- <&> is the operator to compute the bm25 score
SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', tokenize('PostgreSQL', 'bert')) AS bm25_score FROM documents;

And you can use the order by to utilize the index to get the most relevant documents first and faster.

SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', tokenize('PostgreSQL', 'bert')) AS rank
FROM documents
ORDER BY rank
LIMIT 10;

More Examples

Using custom model

Using custom model

You can also build a custom model based on your own corpus easily.

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    passage TEXT,
    embedding bm25vector
);

-- create a text analyzer to generate tokens that can be used to train the model
SELECT create_text_analyzer('text_analyzer1', $$
pre_tokenizer = "unicode_segmentation"  # split texts according to the Unicode Standard Annex #29
[[character_filters]]
to_lowercase = {}                       # convert all characters to lowercase
[[character_filters]]
unicode_normalization = "nfkd"          # normalize the text to Unicode Normalization Form KD
[[token_filters]]
skip_non_alphanumeric = {}              # skip tokens that all characters are not alphanumeric
[[token_filters]]
stopwords = "nltk_english"              # remove stopwords using the nltk dictionary
[[token_filters]]
stemmer = "english_porter2"             # stem tokens using the English Porter2 stemmer
$$);

-- create a model to generate embeddings from original passage
-- It'll train a model from passage column and store the embeddings in the embedding column
SELECT create_custom_model_tokenizer_and_trigger(
    tokenizer_name => 'tokenizer1',
    model_name => 'model1',
    text_analyzer_name => 'text_analyzer1',
    table_name => 'documents',
    source_column => 'passage',
    target_column => 'embedding'
);

INSERT INTO documents (passage) VALUES 
('PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.'),
('Full-text search is a technique for searching in plain-text documents or textual database fields. PostgreSQL supports this with tsvector.'),
('BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query.'),
('PostgreSQL provides many advanced features like full-text search, window functions, and more.'),
('Search and ranking in databases are important in building effective information retrieval systems.'),
('The BM25 ranking algorithm is derived from the probabilistic retrieval framework.'),
('Full-text search indexes documents to allow fast text queries. PostgreSQL supports this through its GIN and GiST indexes.'),
('The PostgreSQL community is active and regularly improves the database system.'),
('Relational databases such as PostgreSQL can handle both structured and unstructured data.'),
('Effective search ranking algorithms, such as BM25, improve search results by understanding relevance.');

CREATE INDEX documents_embedding_bm25 ON documents USING bm25 (embedding bm25_ops);

SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', tokenize('PostgreSQL', 'tokenizer1')) AS rank
FROM documents
ORDER BY rank
LIMIT 10;
Using jieba for Chinese text

Using jieba for Chinese text

For chinese text, you can use jieba pre-tokenizer to segment the text into words. And then train a custom model with segmented words.

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    passage TEXT,
    embedding bm25vector
);

-- create a text analyzer which uses jieba pre-tokenizer
SELECT create_text_analyzer('text_analyzer1', $$
[pre_tokenizer.jieba]
$$);

SELECT create_custom_model_tokenizer_and_trigger(
    tokenizer_name => 'tokenizer1',
    model_name => 'model1',
    text_analyzer_name => 'text_analyzer1',
    table_name => 'documents',
    source_column => 'passage',
    target_column => 'embedding'
);

INSERT INTO documents (passage) VALUES 
('็บขๆตทๆ—ฉ่ฟ‡ไบ†๏ผŒ่ˆนๅœจๅฐๅบฆๆด‹้ขไธŠๅผ€้ฉถ็€๏ผŒไฝ†ๆ˜ฏๅคช้˜ณไพ็„ถไธ้ฅถไบบๅœฐ่ฟŸ่ฝๆ—ฉ่ตท๏ผŒไพตๅ ๅŽปๅคง้ƒจๅˆ†็š„ๅคœใ€‚'),
('ๅคœไปฟไฝ›็บธๆตธไบ†ๆฒนๅ˜ๆˆๅŠ้€ๆ˜Žไฝ“๏ผ›ๅฎƒ็ป™ๅคช้˜ณๆ‹ฅๆŠฑไฝไบ†๏ผŒๅˆ†ไธๅ‡บ่บซๆฅ๏ผŒไนŸ่ฎธๆ˜ฏ็ป™ๅคช้˜ณ้™ถ้†‰ไบ†๏ผŒๆ‰€ไปฅๅค•็…งๆ™š้œž่คชๅŽ็š„ๅคœ่‰ฒไนŸๅธฆ็€้…ก็บขใ€‚'),
('ๅˆฐ็บขๆถˆ้†‰้†’๏ผŒ่ˆน่ˆฑ้‡Œ็š„็กไบบไนŸไธ€่บซ่…ปๆฑ—ๅœฐ้†’ๆฅ๏ผŒๆด—ไบ†ๆพก่ตถๅˆฐ็”ฒๆฟไธŠๅนๆตท้ฃŽ๏ผŒๅˆๆ˜ฏไธ€ๅคฉๅผ€ๅง‹ใ€‚'),
('่ฟ™ๆ˜ฏไธƒๆœˆไธ‹ๆ—ฌ๏ผŒๅˆไธญๅ›ฝๆ—งๅކ็š„ไธ‰ไผ๏ผŒไธ€ๅนดๆœ€็ƒญ็š„ๆ—ถๅ€™ใ€‚ๅœจไธญๅ›ฝ็ƒญๅพ—ๆ›ดๆฏ”ๅธธๅนดๅˆฉๅฎณ๏ผŒไบ‹ๅŽๅคงๅฎถ้ƒฝ่ฏดๆ˜ฏๅ…ตๆˆˆไน‹่ฑก๏ผŒๅ› ไธบ่ฟ™ๅฐฑๆ˜ฏๆฐ‘ๅ›ฝไบŒๅๅ…ญๅนดใ€ไธ€ไนไธ‰ไธƒๅนดใ€‘ใ€‚'),
('่ฟ™ๆกๆณ•ๅ›ฝ้‚ฎ่ˆน็™ฝๆ‹‰ๆ—ฅ้š†ๅญ็ˆตๅท๏ผˆVicomtedeBragelonne๏ผ‰ๆญฃๅ‘ไธญๅ›ฝๅผ€ๆฅใ€‚'),
('ๆ—ฉๆ™จๅ…ซ็‚นๅคš้’Ÿ๏ผŒๅ†ฒๆด—่ฟ‡็š„ไธ‰็ญ‰่ˆฑ็”ฒๆฟๆนฟๆ„ๆœชๅนฒ๏ผŒไฝ†ๅทฒๅๆปกไบ†ไบบ๏ผŒๆณ•ๅ›ฝไบบใ€ๅพทๅ›ฝๆตไบกๅ‡บๆฅ็š„็Šนๅคชไบบใ€ๅฐๅบฆไบบใ€ๅฎ‰ๅ—ไบบ๏ผŒไธ็”จ่ฏด่ฟ˜ๆœ‰ไธญๅ›ฝไบบใ€‚'),
('ๆตท้ฃŽ้‡Œๆ—ฉๅซ็€็‡ฅ็ƒญ๏ผŒ่ƒ–ไบบ่บซไฝ“็ป™็‚Ž้ฃŽๅนๅนฒไบ†๏ผŒไธŠไธ€ๅฑ‚ๆฑ—็ป“็š„็›้œœ๏ผŒไปฟไฝ›ๅˆšๅœจๅทดๅ‹’ๆ–ฏๅฆ็š„ๆญปๆตท้‡Œๆด—่ฟ‡ๆพกใ€‚'),
('ๆฏ•็ซŸๆ˜ฏๆธ…ๆ™จ๏ผŒไบบ็š„ๅ…ด่‡ด่ฟ˜ๆฒก็ป™ๅคช้˜ณๆ™’่Ž๏ผŒ็ƒ˜ๆ‡’๏ผŒ่ฏด่ฏๅšไบ‹้ƒฝๅพˆ่ตทๅŠฒใ€‚'),
('้‚ฃๅ‡ ไธชๆ–ฐๆดพๅˆฐๅฎ‰ๅ—ๆˆ–ไธญๅ›ฝ็งŸ็•Œๅฝ“่ญฆๅฏŸ็š„ๆณ•ๅ›ฝไบบ๏ผŒๆญฃๅ›ดไบ†้‚ฃๅนด่ฝปๅ–„ๆ’’ๅจ‡็š„็Šนๅคชๅฅณไบบๅœจ่ฐƒๆƒ…ใ€‚'),
('ไฟพๆ–ฏ้บฆๆ›พ่ฏด่ฟ‡๏ผŒๆณ•ๅ›ฝๅ…ฌไฝฟๅคงไฝฟ็š„็‰น็‚น๏ผŒๅฐฑๆ˜ฏไธ€ๅฅๅค–ๅ›ฝ่ฏไธไผš่ฎฒ๏ผ›่ฟ™ๅ‡ ไฝ่ญฆๅฏŸๅนถไธๆ‡‚ๅพทๆ–‡๏ผŒๅฑ…็„ถไผ ๆƒ…่พพๆ„๏ผŒๅผ•ๅพ—็Šนๅคชๅฅณไบบๆ ผๆ ผๅœฐ็ฌ‘๏ผŒๆฏ”ไป–ไปฌ็š„ๅค–ไบคๅฎ˜ๅผบๅคšไบ†ใ€‚'),
('่ฟ™ๅฅณไบบ็š„ๆผ‚ไบฎไธˆๅคซ๏ผŒๅœจๆ—้กพ่€Œไนไน‹๏ผŒๅ› ไธบไป–ๅ‡ ๅคฉๆฅ๏ผŒ้ฆ™็ƒŸใ€ๅ•ค้…’ใ€ๆŸ ๆชฌๆฐดๆฒพๅ…‰ไบ†ไธๅฐ‘ใ€‚'),
('็บขๆตทๅทฒ่ฟ‡๏ผŒไธๆ€•็ƒญๆžๅผ•็ซ๏ผŒๆ‰€ไปฅ็ญ‰ไธ€ไผš็”ฒๆฟไธŠ้›ถๆ˜Ÿๆžœ็šฎใ€็บธ็‰‡ใ€็“ถๅกžไน‹ๅค–๏ผŒ้ฆ™็ƒŸๅคดๅฎšๅˆ้ๅค„็š†ๆ˜ฏใ€‚'),
('ๆณ•ๅ›ฝไบบ็š„ๆ€ๆƒณๆ˜ฏๆœ‰ๅ็š„ๆธ…ๆฅš๏ผŒไป–็š„ๆ–‡็ซ ไนŸๆ˜Ž็™ฝๅนฒๅ‡€๏ผŒไฝ†ๆ˜ฏไป–็š„ๅšไบ‹๏ผŒๆ— ไธๆททไนฑใ€่‚ฎ่„ใ€ๅ–งๅ“—๏ผŒไฝ†็œ‹่ฟ™่ˆนไธŠ็š„ไนฑ็ณŸ็ณŸใ€‚'),
('่ฟ™่ˆน๏ผŒๅ€šไป—ไบบ็š„ๆœบๅทง๏ผŒ่ฝฝๆปกไบบ็š„ๆ‰ฐๆ”˜๏ผŒๅฏ„ๆปกไบบ็š„ๅธŒๆœ›๏ผŒ็ƒญ้—นๅœฐ่กŒ็€๏ผŒๆฏๅˆ†้’ŸๆŠŠๆฒพๆฑกไบ†ไบบๆฐ”็š„ไธ€ๅฐๆ–นๅฐ้ข๏ผŒ่ฟ˜็ป™้‚ฃๆ— ๆƒ…ใ€ๆ— ๅฐฝใ€ๆ— ้™…็š„ๅคงๆตทใ€‚');

CREATE INDEX documents_embedding_bm25 ON documents USING bm25 (embedding bm25_ops);

SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', tokenize('ไบบ', 'tokenizer1')) AS rank
FROM documents
ORDER BY rank
LIMIT 10;
Using lindera for Japanese text

Using lindera for Japanese text

For Japanese text, you can use lindera model with its configuration.

It requires extra compile flags. We don't enable it default, and you need to recompile it from source.

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    passage TEXT,
    embedding bm25vector
);

-- using lindera config to customize the tokenizer, see https://github.com/lindera/lindera
SELECT create_lindera_model('lindera_ipadic', $$
[segmenter]
mode = "normal"
  [segmenter.dictionary]
  kind = "ipadic"
[[character_filters]]
kind = "unicode_normalize"
  [character_filters.args]
  kind = "nfkc"
[[character_filters]]
kind = "japanese_iteration_mark"
  [character_filters.args]
  normalize_kanji = true
  normalize_kana = true
[[character_filters]]
kind = "mapping"
[character_filters.args.mapping]
"ใƒชใƒณใƒ‡ใƒฉ" = "Lindera"
[[token_filters]]
kind = "japanese_compound_word"
  [token_filters.args]
  kind = "ipadic"
  tags = [ "ๅ่ฉž,ๆ•ฐ", "ๅ่ฉž,ๆŽฅๅฐพ,ๅŠฉๆ•ฐ่ฉž" ]
  new_tag = "ๅ่ฉž,ๆ•ฐ"
[[token_filters]]
kind = "japanese_number"
  [token_filters.args]
  tags = [ "ๅ่ฉž,ๆ•ฐ" ]
[[token_filters]]
kind = "japanese_stop_tags"
  [token_filters.args]
  tags = [
  "ๆŽฅ็ถš่ฉž",
  "ๅŠฉ่ฉž",
  "ๅŠฉ่ฉž,ๆ ผๅŠฉ่ฉž",
  "ๅŠฉ่ฉž,ๆ ผๅŠฉ่ฉž,ไธ€่ˆฌ",
  "ๅŠฉ่ฉž,ๆ ผๅŠฉ่ฉž,ๅผ•็”จ",
  "ๅŠฉ่ฉž,ๆ ผๅŠฉ่ฉž,้€ฃ่ชž",
  "ๅŠฉ่ฉž,ไฟ‚ๅŠฉ่ฉž",
  "ๅŠฉ่ฉž,ๅ‰ฏๅŠฉ่ฉž",
  "ๅŠฉ่ฉž,้–“ๆŠ•ๅŠฉ่ฉž",
  "ๅŠฉ่ฉž,ไธฆ็ซ‹ๅŠฉ่ฉž",
  "ๅŠฉ่ฉž,็ต‚ๅŠฉ่ฉž",
  "ๅŠฉ่ฉž,ๅ‰ฏๅŠฉ่ฉž๏ผไธฆ็ซ‹ๅŠฉ่ฉž๏ผ็ต‚ๅŠฉ่ฉž",
  "ๅŠฉ่ฉž,้€ฃไฝ“ๅŒ–",
  "ๅŠฉ่ฉž,ๅ‰ฏ่ฉžๅŒ–",
  "ๅŠฉ่ฉž,็‰นๆฎŠ",
  "ๅŠฉๅ‹•่ฉž",
  "่จ˜ๅท",
  "่จ˜ๅท,ไธ€่ˆฌ",
  "่จ˜ๅท,่ชญ็‚น",
  "่จ˜ๅท,ๅฅ็‚น",
  "่จ˜ๅท,็ฉบ็™ฝ",
  "่จ˜ๅท,ๆ‹ฌๅผง้–‰",
  "ใใฎไป–,้–“ๆŠ•",
  "ใƒ•ใ‚ฃใƒฉใƒผ",
  "้ž่จ€่ชž้Ÿณ"
]
[[token_filters]]
kind = "japanese_katakana_stem"
  [token_filters.args]
  min = 3
[[token_filters]]
kind = "remove_diacritical_mark"
  [token_filters.args]
  japanese = false
$$);

SELECT create_tokenizer('lindera_ipadic', $$
model = "lindera_ipadic"
$$);

INSERT INTO documents (passage) VALUES 
('ใฉใ“ใง็”Ÿใ‚ŒใŸใ‹ใจใ‚“ใจ่ฆ‹ๅฝ“ใ‘ใ‚“ใจใ†ใŒใคใ‹ใฌใ€‚'),
('ไฝ•ใงใ‚‚่–„ๆš—ใ„ใ˜ใ‚ใ˜ใ‚ใ—ใŸๆ‰€ใงใƒ‹ใƒฃใƒผใƒ‹ใƒฃใƒผๆณฃใ„ใฆใ„ใŸไบ‹ใ ใ‘ใฏ่จ˜ๆ†ถใ—ใฆใ„ใ‚‹ใ€‚'),
('ๅพ่ผฉใฏใ“ใ“ใงๅง‹ใ‚ใฆไบบ้–“ใจใ„ใ†ใ‚‚ใฎใ‚’่ฆ‹ใŸใ€‚'),
('ใ—ใ‹ใ‚‚ใ‚ใจใง่žใใจใใ‚Œใฏๆ›ธ็”Ÿใจใ„ใ†ไบบ้–“ไธญใงไธ€็•ช็ฐๆ‚ชใฉใ†ใ‚ใใช็จฎๆ—ใงใ‚ใฃใŸใใ†ใ ใ€‚'),
('ใ“ใฎๆ›ธ็”Ÿใจใ„ใ†ใฎใฏๆ™‚ใ€…ๆˆ‘ใ€…ใ‚’ๆ•ใคใ‹ใพใˆใฆ็…ฎใซใฆ้ฃŸใ†ใจใ„ใ†่ฉฑใงใ‚ใ‚‹ใ€‚'),
('ใ—ใ‹ใ—ใใฎๅฝ“ๆ™‚ใฏไฝ•ใจใ„ใ†่€ƒใ‚‚ใชใ‹ใฃใŸใ‹ใ‚‰ๅˆฅๆฎตๆใ—ใ„ใจใ‚‚ๆ€ใ‚ใชใ‹ใฃใŸใ€‚'),
('ใŸใ ๅฝผใฎๆŽŒใฆใฎใฒใ‚‰ใซ่ผ‰ใ›ใ‚‰ใ‚Œใฆใ‚นใƒผใจๆŒใกไธŠใ’ใ‚‰ใ‚ŒใŸๆ™‚ไฝ•ใ ใ‹ใƒ•ใƒฏใƒ•ใƒฏใ—ใŸๆ„Ÿใ˜ใŒใ‚ใฃใŸใฐใ‹ใ‚Šใงใ‚ใ‚‹ใ€‚'),
('ๆŽŒใฎไธŠใงๅฐ‘ใ—่ฝใกใคใ„ใฆๆ›ธ็”Ÿใฎ้ก”ใ‚’่ฆ‹ใŸใฎใŒใ„ใ‚ใ‚†ใ‚‹ไบบ้–“ใจใ„ใ†ใ‚‚ใฎใฎ่ฆ‹ๅง‹ใฟใฏใ˜ใ‚ใงใ‚ใ‚ใ†ใ€‚'),
('ใ“ใฎๆ™‚ๅฆ™ใชใ‚‚ใฎใ ใจๆ€ใฃใŸๆ„Ÿใ˜ใŒไปŠใงใ‚‚ๆฎ‹ใฃใฆใ„ใ‚‹ใ€‚'),
('็ฌฌไธ€ๆฏ›ใ‚’ใ‚‚ใฃใฆ่ฃ…้ฃพใ•ใ‚Œในใใฏใšใฎ้ก”ใŒใคใ‚‹ใคใ‚‹ใ—ใฆใพใ‚‹ใง่–ฌ็ผถใ‚„ใ‹ใ‚“ใ ใ€‚'),
('ใใฎๅพŒใ”็Œซใซใ‚‚ใ ใ„ใถ้€ขใ‚ใฃใŸใŒใ“ใ‚“ใช็‰‡่ผชใ‹ใŸใ‚ใซใฏไธ€ๅบฆใ‚‚ๅ‡บไผšใงใใ‚ใ—ใŸไบ‹ใŒใชใ„ใ€‚'),
('ใฎใฟใชใ‚‰ใš้ก”ใฎ็œŸไธญใŒใ‚ใพใ‚Šใซ็ช่ตทใ—ใฆใ„ใ‚‹ใ€‚'),
('ใใ†ใ—ใฆใใฎ็ฉดใฎไธญใ‹ใ‚‰ๆ™‚ใ€…ใทใ†ใทใ†ใจ็…™ใ‘ใ‚€ใ‚Šใ‚’ๅนใใ€‚'),
('ใฉใ†ใ‚‚ๅ’ฝใ‚€ใ›ใฝใใฆๅฎŸใซๅผฑใฃใŸใ€‚'),
('ใ“ใ‚ŒใŒไบบ้–“ใฎ้ฃฒใ‚€็…™่‰ใŸใฐใ“ใจใ„ใ†ใ‚‚ใฎใงใ‚ใ‚‹ไบ‹ใฏใ‚ˆใ†ใ‚„ใใ“ใฎ้ ƒ็ŸฅใฃใŸใ€‚');

UPDATE documents SET embedding = tokenize(passage, 'lindera_ipadic');

CREATE INDEX documents_embedding_bm25 ON documents USING bm25 (embedding bm25_ops);

SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', tokenize('ๆ›ธ็”Ÿ', 'lindera_ipadic')) AS rank
FROM documents
ORDER BY rank
LIMIT 10;

Tokenizer

Tokenizer configuration is a critical aspect of effective text processing, significantly impacting the performance and accuracy. Here are some key considerations and options to help you choose the right tokenizer for your use case.

Tokenizer Options

Tokenizers can be configured in two primary ways:

  • Pre-Trained Models: Suitable for most standard use cases, these models are efficient and require minimal setup. They are ideal for general-purpose applications where the text aligns with the model's training data.
  • Custom Models: Offer flexibility and superior accuracy for specialized texts. These models are trained on specific corpora, making them suitable for domains with unique terminology, such as technical fields or industry-specific jargon.

Usage Details can be found in pg_tokenizer doc

Key Considerations

  1. Language and Script:
  • Space-Separated Languages (e.g., English, Spanish, German): Simple tokenizers such as bert (for English) or unicode tokenizers are effective here.
  • Non-Space-Separated Languages (e.g., Chinese, Japanese): These require specialized algorithms (pre-tokenizer) that understand language structure beyond simple spaces. You can refer Chinese and Japanese example.
  • Multilingual Data: Handling multiple languages within a single index requires tokenizers designed for multilingual support, such as gemma2b or llmlingua2, which efficiently manage diverse scripts and languages.
  1. Vocabulary Complexity:
  • Standard Language: For texts with common vocabulary, pre-trained models are sufficient. They handle everyday language efficiently without requiring extensive customization.
  • Specialized Texts: Technical terms, abbreviations (e.g., "k8s" for Kubernetes), or compound nouns may need custom models. Custom models can be trained to recognize domain-specific terms, ensuring accurate tokenization. Custom synonyms may also be necessary for precise results. See custom model example.

Preload (for performance)

For each connection, Postgresql will load the model at the first time you use it. This may cause a delay for the first query. You can use the add_preload_model function to preload the model at the server startup.

psql -c "SELECT add_preload_model('model1')"
# restart the PostgreSQL to take effects
sudo docker restart container_name         # for pg_tokenizer running with docker
sudo systemctl restart postgresql.service  # for pg_tokenizer running with systemd

The default preload model is llmlingua2. You can change it by using add_preload_model, remove_preload_model functions.

Note: The pre-trained model may take a lot of memory (100MB for gemma2b, 200MB for llmlingua2). If you have a lot of models, you should consider the memory usage when you preload the model.

Comparison to other solution in Postgres

PostgreSQL supports full-text search using the tsvector data type and GIN indexes. Text is transformed into a tsvector, which tokenizes content into standardized lexemes, and a GIN index accelerates searchesโ€”even on large text fields. However, PostgreSQL lacks modern relevance scoring methods like BM25; it retrieves all matching documents and re-ranks them using ts_rank, which is inefficient and can obscure the most relevant results.

ParadeDB is an alternative that functions as a full-featured PostgreSQL replacement for ElasticSearch. It offloads full-text search and filtering operations to Tantivy and includes BM25 among its features, though it uses a different query and filter syntax than PostgreSQL's native indexes.

In contrast, Vectorchord-bm25 focuses exclusively on BM25 ranking within PostgreSQL. We implemented the BM25 ranking algorithm Block WeakAnd from scratch and built it as a custom operator and index (similar to pgvector) to accelerate queries. It is designed to be lightweight and a more native and intuitive API for better full-text search and ranking in PostgreSQL.

Limitation

  • The index will return up to bm25_catalog.bm25_limit results to PostgreSQL. Users need to adjust the bm25_catalog.bm25_limit for more results when using larger limit values or stricter filter conditions.
  • We currently have only tested against English. Other language can be supported with bpe tokenizer with larger vocab like tiktoken out of the box. Feel free to talk to us or raise issue if you need more language support.

Reference

Data Types

  • bm25vector: A specialized vector type for storing BM25 tokenized text. Structured as a sparse vector, it stores token IDs and their corresponding frequencies. For example, {1:2, 2:1} indicates that token ID 1 appears twice and token ID 2 appears once in the document.
  • bm25query: A query type for BM25 ranking.

Functions

  • to_bm25query(index_name regclass, query_vector bm25vector) RETURNS bm25query: Convert the input text into a BM25 query.

Operators

  • bm25vector = bm25vector RETURNS boolean: Check if two BM25 vectors are equal.
  • bm25vector <> bm25vector RETURNS boolean: Check if two BM25 vectors are not equal.
  • bm25vector <&> bm25query RETURNS float4: Calculate the negative BM25 score between the BM25 vector and query. The lower the score, the more relevant the document is. (This is intentionally designed to be negative for easier sorting.)

Casts

  • int[]::bm25vector (implicit): Cast an integer array to a BM25 vector. The integer array represents token IDs, and the cast aggregates duplicates into frequencies, ignoring token order. For example, {1, 2, 1} will be cast to {1:2, 2:1} (token ID 1 appears twice, token ID 2 appears once).

GUCs

  • bm25_catalog.bm25_limit (integer): The maximum number of documents to return in a search. Default is 100, minimum is -1, and maximum is 65535. When set to -1, it will perform brute force search and return all documents with scores greater than 0.
  • bm25_catalog.enable_index (boolean): Whether to enable the bm25 index. Default is true.
  • bm25_catalog.segment_growing_max_page_size (integer): The maximum page count of the growing segment. When the size of the growing segment exceeds this value, the segment will be sealed into a read-only segment. Default is 4,096, minimum is 1, and maximum is 1,000,000.

License

This software is licensed under a dual license model:

  1. GNU Affero General Public License v3 (AGPLv3): You may use, modify, and distribute this software under the terms of the AGPLv3.

  2. Elastic License v2 (ELv2): You may also use, modify, and distribute this software under the Elastic License v2, which has specific restrictions.

You may choose either license based on your needs. We welcome any commercial collaboration or support, so please email us [email protected] with any questions or requests regarding the licenses.

About

Native BM25 Ranking Index in PostgreSQL

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 8