- Davit Shavit https://github.com/davis0011
- Noam Kent https://github.com/Kentno
- Take Parquet files from bucket
- Use slightly modified version of assignment 3 gcp code to make indeciese
- Create a dict of metadata on each document, ex. full title with no tokenizing/stopword removal
- Transfer all data to an instance to run the retvival engine
All search functions other than "search" are according to the requirements.
-
Search: implemented using a combination of BM25 on the body text, a count of the number of words in the title and a binary decision on anchor text.
-
The scores are weighted 2-4-1, which was reached by testing severeal times on different parts of the training set.
-
The main code block is the calculation of BM25 scores for each document in a numpy matrix, there is an explination of this in the report. See the graphic below: