-
Notifications
You must be signed in to change notification settings - Fork 44
Subject exclusion support #840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #840 +/- ##
==========================================
- Coverage 99.64% 99.63% -0.01%
==========================================
Files 96 99 +3
Lines 7242 7350 +108
==========================================
+ Hits 7216 7323 +107
- Misses 26 27 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e8dd91e to
8c4526a
Compare
8c4526a to
5dfc00a
Compare
f9dfa61 to
592c093
Compare
juhoinkinen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
I tested the exclusion support using TFIDF, Omikuji and MLLM backends. It seems to work as expected for both training and suggestion; the excluded subjects are simply ignored. For TFIDF and Omikuji (associative methods), if a concept is excluded during train time, the model will not learn it at all. Even if I drop the exclude setting from the project configuration after training, the project will never suggest it. For MLLM (lexical method), the model will not learn individual concepts, so if I train a project with a concept excluded, then drop the exclude setting, the model will be able to suggest it. |



Fixes #735
This PR makes it possible to exclude specific subjects from a vocabulary by their URIs. It is controlled by a keyword parameter for the
vocabsetting, like this:or for the ZBW / STW Thesaurus case (excluding the frequent false positives Theory and USA - note that
|is used as the separator between URIs):Refactoring
This turned into quite a refactoring exercise so the PR is quite large with lots of churn, even though the filtering/exclusion functionality itself isn't that much extra code.
I had to adjust the responsibilities of various classes (AnnifProject, AnnifRegistry, AnnifVocabulary, SubjectIndex) to make it possible to implement SubjectIndexFilter. I also abstracted SubjectIndex into an abstract base class and two concrete implementations, SubjectIndexFile (basically what was in the old SubjectIndex) and SubjectIndexFilter (new class that implements the exclusion).
The old module structure related to the vocabulary and subject index functionality was quite messy, so I ended up separating vocabulary-specific functionality into
annif.vocab(now a directory) and moved several classes into this new module fromannif.corpuswhere they had been located mixed up with classes related to document corpora.I also renamed some classes:
SubjectCorpus->VocabSourceandSubjectFile*->VocabFile*. I think these names better represent their function.Quality assurance complaints
SonarCloud has 1 complaint and CodeClimate shows 8 complaint. These were all in code that was moved around in this PR, but the issues were not introduced in this PR.
Future work
This was just the first step in supporting exclusion. I would like to continue with more features after this PR:
exclude=*rule(followed by include rules) - this could probably be implemented using the fnmatch standard library