Thanks to visit codestin.com
Credit goes to github.com

Skip to content

kakkeshyor/txt

Repository files navigation

'txt' is a search engine that implements the tf-idf algorithms that work on TREC data-sets to demonstrate retrieval and evaluation in typical Information Retrieval experiments. It tokenizes the corpus and queries converting them into its own format, builds an inverted index out of the corpus, which can then be searched by querying it using the tokenized queries. The search results can then be ranked and evaluated.

Tokenize the corpus and query:

    ./raw2t -x -n -c TRECQUERY <q.txt >q.t
    ./raw2t -x -n -c TREC <d.txt >d.t

Print readable tokenized files (if needed):

    ./t2mem <q.t >q.mem
    ./t2mem <d.t >d.mem

Build inverted index from tokenized corpus and search (-s) using
queries:

    ./ii -s q.t <d.t >res

Rank the search result:

    sort -k1,1 -k3,3nr res >rank

Convert the result to TREC run format:

    awk -f txt2trecrun.awk <rank >run

About

Search engine.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published