Codestin Search App

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.clang-format		.clang-format
Makefile		Makefile
README		README
crc.c		crc.c
crc.h		crc.h
cw.h		cw.h
f		f
ii.c		ii.c
parser1.c		parser1.c
parser2.c		parser2.c
porter.c		porter.c
raw2t.c		raw2t.c
raw2t1.c		raw2t1.c
stemmer.h		stemmer.h
t2mem.c		t2mem.c
tfile.c		tfile.c
tfile.h		tfile.h
tokenizer.c		tokenizer.c
tokenizer.h		tokenizer.h
txt.c		txt.c
txt.h		txt.h
txt1.c		txt1.c
txt1.h		txt1.h

Repository files navigation

'txt' is a search engine that implements the tf-idf algorithms that work on TREC data-sets to demonstrate retrieval and evaluation in typical Information Retrieval experiments. It tokenizes the corpus and queries converting them into its own format, builds an inverted index out of the corpus, which can then be searched by querying it using the tokenized queries. The search results can then be ranked and evaluated.

Tokenize the corpus and query:

    ./raw2t -x -n -c TRECQUERY <q.txt >q.t
    ./raw2t -x -n -c TREC <d.txt >d.t

Print readable tokenized files (if needed):

    ./t2mem <q.t >q.mem
    ./t2mem <d.t >d.mem

Build inverted index from tokenized corpus and search (-s) using
queries:

    ./ii -s q.t <d.t >res

Rank the search result:

    sort -k1,1 -k3,3nr res >rank

Convert the result to TREC run format:

    awk -f txt2trecrun.awk <rank >run