Thanks to visit codestin.com
Credit goes to github.com

Skip to content

davis0011/IR-Project

Repository files navigation

IR-Project

Presenting

Basic Preperation Overview

  •   Take Parquet files from bucket
  •   Use slightly modified version of assignment 3 gcp code to make indeciese
  •   Create a dict of metadata on each document, ex. full title with no tokenizing/stopword removal
  •   Transfer all data to an instance to run the retvival engine

Main Search Overview

All search functions other than "search" are according to the requirements.

  •   Search: implemented using a combination of BM25 on the body text, a count of the number of words in the title and a binary decision on anchor text.

  • The scores are weighted 2-4-1, which was reached by testing severeal times on different parts of the training set.

  •   The main code block is the calculation of BM25 scores for each document in a numpy matrix, there is an explination of this in the report. See the graphic below:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •