Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Request for Improvements: improve search functionality! #6

@eladve

Description

@eladve

Some inspiration for the below can come from web search (e.g. google) , as well as from memex (https://getmemex.com/).

Track 1: Improve search logic:
right now I only implemented single-word search. To improve this add:

  • multiword search, exact-pharse search, etc (a-la google, memex, ...). can do this by deploying an existing solution (either python library or database). Or can implement this from scratch. Shouldn't be too hard. E.g. to look for two words, just take the hit list of each of them, and intersect the lists.
  • better prioritization/ranking of results. (Remember that screenshots overlap so there might be many hits that are very close in time. Need to prioritize so that only one of those is chosen for the "first page" -- maybe by using some clustering on the timestamps, something like k-means on the timestamps). Search result ranking is a common problem so there might be some open-sorce library that can be a good place to start.
  • maybe add meta-data (e.g. which application is used, etc) and allow to search also on this.

Track 2: Better NLP/tokenization/etc:

  • right now the text is not tokenized properly, and that hurts the quality of the inverted index. For example words with dashes are not-split, special characters are not removed properly, and in general, tokens are not recognized in the best way to enable them to be used in a future search. Should fix that. Should be easy to get a 10x improvement by using NLP tools like NLTK https://www.nltk.org/
  • can do even more NLP to enrich with metadata: named entity recognition, etc. . (for each named entity that is recognized, an email that is recognized, etc, can just add it to the inverted index and that will allow future search).

Track 3: GUI and interactivity:

  • once the query is executed and properly-ranked search results are returned, the user still needs to get value by scrolling through the hits, examining them, maybe refining the search, looking at the OCR results, and in general, looking for the hit that they wanted. Need to improve this via GUI and interactivity: allowing to scroll through the results, pick some, magnify them, scrolling through the timeline, etc

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions