stem, a stemming algorithm in OCaml

A stemming algorithm is an algorithm that attempts to find the root of words. This library allows you to "tokenize" a document and apply the stemming algorithm to these tokens (considered to be words). It then calculates the frequency of occurrence of these words and produces a CSV document mapping the "stems" to their frequencies.

The purpose of stemming is to be able to treat several words (such as "tout", "toutes", and "tous") as a single root. This way, the resulting stems and their frequencies better reflect the information the document is trying to convey. The idea is then to enable document indexing based on these stems.

How to install it and use it?

stem is a package available through OPAM. It provides two tools: stemmer and stem.ts. The latter allows you to specify multiple tokenizers, the language, and the way the result is displayed in CSV format:

$ opam install stem
$ stem.ts -l french -a bert:isolate -a whitespace:remove file.txt
"est",14                             
"son",13
"tout",11
"Julien",11
"plus",9
"trouv",8
"dan",8
"bien",7
"m\195\170m",7
...

A little search engine (BM25)

The distribution also offers the implementation of a small search engine (based on the Okapi BM25 algorithm). It uses stems to calculate occurrences (which avoids duplicates such as "should" or "shoulds").

A search tool is available that allows the relevance of documents to be rated according to a query, based on a given corpus.

$ opam install bm25
$ okapi -d file0.txt -d file1.txt -d file2.txt -l french "un chat noir"
file0.txt: 1.356894
file2.txt: 0.439572
file1.txt: 0.000000

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
bin		bin
bm25		bm25
snowball		snowball
stopwords		stopwords
test		test
tokenizer		tokenizer
.gitignore		.gitignore
.ocamlformat		.ocamlformat
README.md		README.md
bm25.opam		bm25.opam
dune-project		dune-project
stem.opam		stem.opam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

stem, a stemming algorithm in OCaml

How to install it and use it?

A little search engine (BM25)

About

Uh oh!

Releases

Packages

Languages

robur-coop/stem

Folders and files

Latest commit

History

Repository files navigation

stem, a stemming algorithm in OCaml

How to install it and use it?

A little search engine (BM25)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages