-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Some inspiration for the below can come from web search (e.g. google) , as well as from memex (https://getmemex.com/).
Track 1: Improve search logic:
right now I only implemented single-word search. To improve this add:
- multiword search, exact-pharse search, etc (a-la google, memex, ...). can do this by deploying an existing solution (either python library or database). Or can implement this from scratch. Shouldn't be too hard. E.g. to look for two words, just take the hit list of each of them, and intersect the lists.
- better prioritization/ranking of results. (Remember that screenshots overlap so there might be many hits that are very close in time. Need to prioritize so that only one of those is chosen for the "first page" -- maybe by using some clustering on the timestamps, something like k-means on the timestamps). Search result ranking is a common problem so there might be some open-sorce library that can be a good place to start.
- maybe add meta-data (e.g. which application is used, etc) and allow to search also on this.
Track 2: Better NLP/tokenization/etc:
- right now the text is not tokenized properly, and that hurts the quality of the inverted index. For example words with dashes are not-split, special characters are not removed properly, and in general, tokens are not recognized in the best way to enable them to be used in a future search. Should fix that. Should be easy to get a 10x improvement by using NLP tools like NLTK https://www.nltk.org/
- can do even more NLP to enrich with metadata: named entity recognition, etc. . (for each named entity that is recognized, an email that is recognized, etc, can just add it to the inverted index and that will allow future search).
Track 3: GUI and interactivity:
- once the query is executed and properly-ranked search results are returned, the user still needs to get value by scrolling through the hits, examining them, maybe refining the search, looking at the OCR results, and in general, looking for the hit that they wanted. Need to improve this via GUI and interactivity: allowing to scroll through the results, pick some, magnify them, scrolling through the timeline, etc