Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@osma
Copy link
Member

@osma osma commented Mar 26, 2025

Fixes #735

This PR makes it possible to exclude specific subjects from a vocabulary by their URIs. It is controlled by a keyword parameter for the vocab setting, like this:

vocab=yso(en,exclude=http://www.yso.fi/onto/yso/p12345)

or for the ZBW / STW Thesaurus case (excluding the frequent false positives Theory and USA - note that | is used as the separator between URIs):

vocab=stw(en,exclude=http://zbw.eu/stw/descriptor/19073-6|http://zbw.eu/stw/descriptor/17829-1)

Refactoring

This turned into quite a refactoring exercise so the PR is quite large with lots of churn, even though the filtering/exclusion functionality itself isn't that much extra code.

I had to adjust the responsibilities of various classes (AnnifProject, AnnifRegistry, AnnifVocabulary, SubjectIndex) to make it possible to implement SubjectIndexFilter. I also abstracted SubjectIndex into an abstract base class and two concrete implementations, SubjectIndexFile (basically what was in the old SubjectIndex) and SubjectIndexFilter (new class that implements the exclusion).

The old module structure related to the vocabulary and subject index functionality was quite messy, so I ended up separating vocabulary-specific functionality into annif.vocab (now a directory) and moved several classes into this new module from annif.corpus where they had been located mixed up with classes related to document corpora.

I also renamed some classes: SubjectCorpus -> VocabSource and SubjectFile* -> VocabFile*. I think these names better represent their function.

Quality assurance complaints

SonarCloud has 1 complaint and CodeClimate shows 8 complaint. These were all in code that was moved around in this PR, but the issues were not introduced in this PR.

Future work

This was just the first step in supporting exclusion. I would like to continue with more features after this PR:

@osma osma added this to the 1.4 milestone Mar 26, 2025
@osma osma self-assigned this Mar 26, 2025
@codecov
Copy link

codecov bot commented Mar 26, 2025

Codecov Report

Attention: Patch coverage is 99.75845% with 1 line in your changes missing coverage. Please review.

Project coverage is 99.63%. Comparing base (ee2f456) to head (4784c7b).
Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
annif/project.py 94.44% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #840      +/-   ##
==========================================
- Coverage   99.64%   99.63%   -0.01%     
==========================================
  Files          96       99       +3     
  Lines        7242     7350     +108     
==========================================
+ Hits         7216     7323     +107     
- Misses         26       27       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@osma osma force-pushed the issue735-subject-filtering branch from e8dd91e to 8c4526a Compare March 26, 2025 10:29
@osma osma force-pushed the issue735-subject-filtering branch from 8c4526a to 5dfc00a Compare March 26, 2025 13:17
@osma osma force-pushed the issue735-subject-filtering branch from f9dfa61 to 592c093 Compare March 27, 2025 11:17
@osma osma changed the title WIP: Subject filtering Subject exclusion support Mar 27, 2025
@osma osma marked this pull request as ready for review March 27, 2025 13:47
@osma osma requested a review from juhoinkinen March 27, 2025 13:48
Copy link
Member

@juhoinkinen juhoinkinen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, also I think this makes the vocabulary and subject index functionality better. 👍

There were some tests here and here where variable renaming for the return value for annif.vocab.VocabFile* could be considered as those lines are now being edited (there is subjects and corpus).

@sonarqubecloud
Copy link

@osma
Copy link
Member Author

osma commented Mar 28, 2025

I tested the exclusion support using TFIDF, Omikuji and MLLM backends. It seems to work as expected for both training and suggestion; the excluded subjects are simply ignored.

For TFIDF and Omikuji (associative methods), if a concept is excluded during train time, the model will not learn it at all. Even if I drop the exclude setting from the project configuration after training, the project will never suggest it.

For MLLM (lexical method), the model will not learn individual concepts, so if I train a project with a concept excluded, then drop the exclude setting, the model will be able to suggest it.

@osma osma merged commit 38f5e8a into main Mar 28, 2025
15 of 17 checks passed
@osma osma deleted the issue735-subject-filtering branch March 28, 2025 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dealing with overrepresented concepts / blacklisting

3 participants