-
Notifications
You must be signed in to change notification settings - Fork 0
kakkeshyor/txt
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
'txt' is a search engine that implements the tf-idf algorithms that work on TREC data-sets to demonstrate retrieval and evaluation in typical Information Retrieval experiments. It tokenizes the corpus and queries converting them into its own format, builds an inverted index out of the corpus, which can then be searched by querying it using the tokenized queries. The search results can then be ranked and evaluated.
Tokenize the corpus and query:
./raw2t -x -n -c TRECQUERY <q.txt >q.t
./raw2t -x -n -c TREC <d.txt >d.t
Print readable tokenized files (if needed):
./t2mem <q.t >q.mem
./t2mem <d.t >d.mem
Build inverted index from tokenized corpus and search (-s) using
queries:
./ii -s q.t <d.t >res
Rank the search result:
sort -k1,1 -k3,3nr res >rank
Convert the result to TREC run format:
awk -f txt2trecrun.awk <rank >run
About
Search engine.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published