IR-Project

Presenting

Take Parquet files from bucket
Use slightly modified version of assignment 3 gcp code to make indeciese
Create a dict of metadata on each document, ex. full title with no tokenizing/stopword removal
Transfer all data to an instance to run the retvival engine

All search functions other than "search" are according to the requirements.

Search: implemented using a combination of BM25 on the body text, a count of the number of words in the title and a binary decision on anchor text.
The scores are weighted 2-4-1, which was reached by testing severeal times on different parts of the training set.
The main code block is the calculation of BM25 scores for each document in a numpy matrix, there is an explination of this in the report. See the graphic below:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
misc_files		misc_files
IR-report.pdf		IR-report.pdf
README.md		README.md
indexing_notebook.ipynb		indexing_notebook.ipynb
inverted_index_gcp.py		inverted_index_gcp.py
run_frontend_in_gcp.sh		run_frontend_in_gcp.sh
search_frontend.py		search_frontend.py
w2v_1.ipynb		w2v_1.ipynb