Enhancements

@CaptSolo

This is a patch release that fixes a bug arisen after the Annif 1.4.0 release.

Bug fixes
#901/#904 Fix message for loading SKOS vocabs (credit: @CaptSolo)
#902/#903 Fix field size limitation in CSV corpus format (credit: @RietdorfC)
#908 Fix zip destination path in Hugging Face repo when using custom data directory location
#912/#913 Add vocab size and loaded status to project information returned by REST API

@c-poley

This release introduces three new corpus formats: a JSON-based full text corpus format (one file per document) and two short-text formats, one based on JSON Lines and another based on CSV. All the new corpus formats include support for document IDs as well as metadata: it is now possible to include structured information such as titles and abstracts for documents. This flexibility is intended to improve the handling of documents that require additional context beyond just the text itself; projects may be configured to operate only on specific metadata fields using the new select transform. All the new corpus formats can be used alongside existing formats.

It is now possible to exclude and include subjects from a vocabulary. Excluding individual concepts can be useful in cases where algorithms frequently produce incorrect subject suggestions. Using exclude and include rules, it is also possible to define more specialized projects that operate on only one type or class of concepts.

Several improvements have been made to the REST API, including exposing vocabulary information via the vocabs method and disabling the learn method by default (controlled by the allow_learn setting in the NN ensemble backend).

The annif index command can now be used on short-text corpus formats (TSV, CSV or JSON Lines) in addition to full text formats (TXT+TSV or JSON). In the case of short-text formats, output including the suggested subjects and their scores is produced in JSON Lines format.

The hyperopt command has been enhanced to better support parallel processing on multiple CPU cores, which can significantly reduce overall processing time.

This release also adds support for Python 3.13, ensuring compatibility with the latest Python version. Furthermore, the tfidf backend has been refactored to eliminate the dependency on gensim, which addresses compatibility issues and simplifies the codebase. Support for Python 3.9 has been dropped. Various maintenance updates and bug fixes are included, such as resolving warnings related to Click and upgrading many libraries to more recent versions.

Special thanks to the German National Library (DNB) EMa team (@c-poley, @RietdorfC, @san-uh) for their work on proposing, specifying and testing the new features in this release!

Supported Python versions:

3.10, 3.11, 3.12 and 3.13

Backward compatibility:

⚠️ tfidf projects trained with Annif 1.3 or older need to be retrained.
For other projects, the warnings by SciKit-learn are harmless.
⚠️ This is very likely the last Annif minor release to support the current fasttext backend, because the original fastText library is no longer maintained and there are compatibility issues with other libraries. We are looking for alternative implementations of fasttext.

Enhancements:
#875/#876 Add JSONL short text corpus format
#872/#868 Support metadata in fulltext corpus format / JSON fulltext corpus format
#886/#885 Support document_id in JSON(L) and CSV corpus formats & JSONL output
#889/#639/#877 Support for all corpus formats in annif index CLI command
#863/#140 Flexible fusion part 1: CSV short-text document corpus format
#864 Flexible fusion part 2: core functionality
#866 Flexible fusion part 3: CLI suggest option for additional metadata
#867 Flexible fusion part 4: REST API document metadata support
#844/#846 Support exclude/include rules for vocabulary concepts
#735/#840 Support subject exclusion / Dealing with overrepresented concepts / denylisting
#839/#837 Expose vocabulary information via REST API
#843 Disable /learn REST API method by default
#688/#873 Parallel hyperparameter optimization using multiple CPU cores

Maintenance:
#878 Remove gensim dependency in tfidf backend
#871 Update dependencies for 1.4 release
#890 Use NumPy 2 compatible fastText fork
#849 Drop Python 3.9 support
#850/#869 Support Python 3.13
#884 Upgrade to Poetry 2.0 / Resolve Poetry deprecation warnings
#848 Resolve DeprecationWarning: avoid use of datetime.utcfromtimestamp
#852/#891 Bump GitHub Actions versions

Fixes:
#882 Resolve UserWarning: The parameter --verbosity... for annif list-* CLI commands
#847 Add superclass constructor call to LMDBSequence, to prevent TensorFlow warning
#874 JSON corpus bugfix: avoid parsing subjects in annif index
#887 Fix slow annif train JSONL test & avoid slow jsonschema import

This is a patch release that fixes a bug arisen after the Annif 1.3.0 release.

Bug fixes
#838 Fix Swagger UI 500 error Failed to load API definition.

@RietdorfC

This release introduces a new EstNLTK analyzer, improves the performance of the MLLM backend and fixes minor bugs.

The key enhancement of this release is the addition of a new analyzer for lemmatization using EstNLTK, which supports the Estonian language. This analyzer needs to be installed separately, see the Optional features and dependencies in Wiki. Note that the indirect dependencies of EstNLTK are quite large, requiring around 500 MB of libraries.

Another improvement is the optimization of the ambiguity feature calculation in the MLLM algorithm. Previously, the calculation could be slow, especially when dealing with a large number of matches when using a large vocabulary such as GND. This optimization addresses the quadratic nature of the ambiguity calculation, and is expected to greatly reduce the processing time of some documents.

This release also includes maintenance updates and bug fixes. The file permissions issue, where Annif did not adhere to the umask setting for data files, has been resolved, thus easing Annif use in multiuser environments.

Supported Python versions:

3.9, 3.10, 3.11, and 3.12

Backward compatibility:

The projects trained with Annif v1.2 remain working.

Enhancements
#818/#831 Add a new EstNLTK analyzer
#822/#825/#834 Optimize MLLM ambiguity calculation to resolve slow processing of specific documents. Thanks to @RietdorfC (DNB) for reporting the issue and testing the optimized code.
#820 Smarter initialization of optional analyzers

Maintenance
#833 Update dependencies for v1.3 release
#821/#830 Bump the github-actions versions

Bug Fixes
#828 Fix Docker image builds with Poetry 2.0
#832/#829 Ensure file permissions respect the umask setting

This is a patch release that fixes a bug arisen after the Annif 1.2.0 release.

Bug fixes
#823 Resolve 413 Client Error: Request Entity Too Large for url errors that arise for requests to the /suggest method when the request body exceeded 500 KB

@adbar

This release introduces language detection capabilities in the REST API and CLI, improves 🤗 Hugging Face Hub integration, and also includes the usual maintenance work and minor bug fixes.

The new REST API endpoint /v1/detect-language expects POST requests that contain a JSON object with the text whose language is to be analyzed and a list of candidate languages. Similarly, the CLI has a new command annif detect-language. Annif projects are typically language specific, so a text of a given language needs to be processed with a project intended for that language; the language detection feature can help in this. For details see this Wiki page. The language detection is performed with the Simplemma library by @adbar et al.

The annif download command has a new --trust-repo option, which needs to be used if the repository to download from has not been used previously (that is if the repository does not appear in the local Hugging Face Hub cache). This option is introduced to raise awareness of the risks of downloading projects from the internet; the project downloads should only be done from trusted sources. For more information see the Hugging Face Hub documentation.

This release also includes automation of downloading the NLTK datapackage used for tokenization to simplify Annif installation. Maintenance tasks include upgrading dependencies, including a new version of Simplemma that allows better control over memory usage. The bug fixes include restoring the --host option of the annif run command.

Python 3.12 is now fully supported (previously NN-ensemble and STWFSA backends were not supported on Python 3.12).

Supported Python versions:

3.9, 3.10,. 3.11 and 3.12

Backward compatibility:

NN ensemble projects trained with Annif v1.1 or older need to be retrained.
For other projects, the warnings by SciKit-learn are harmless.

Enhancements

#659/#799/#800/#801/#802 Language detection in REST API and CLI
#779 Python 3.12 support
#790/#793 Automatically add metadata to Hugging Face Hub repos when uploading projects
#809 Make field widths variable in the projects list of the Hugging Face Hub Model Card
#803 Automate NLTK datapackage punkt_tab download
#807 Add --trust-repo option to download CLI command

Maintenance

#724 Upgrade Simplemma & limit its memory usage
#796 Update dependencies for 1.2 release
#797/#811 Bump the github-actions versions
#805 Upgrade Docker baseimage to Python 3.12

Bug fixes

#788 Add --host option to annif run (credit: @dwinston)
#792 Fix limit parameter not passed to requests by HTTP backend
#808 Fix missing Hugging Face Hub token from preupload_lfs_files() parameters

This release introduces CLI commands to share projects via Hugging Face Hub, takes care of various maintenance tasks and fixes minor bugs.

The 🤗 Hugging Face Hub intends to facilitate the sharing of AI models and datasets, and now Annif CLI includes upload and download commands, which can be used to push and pull a set of selected projects and vocabularies to and from a Hugging Face Hub repository. In this release these commands are regarded experimental; there can be changes in them in the future. See this Wiki page for more information about the commands. See also this Hugging Face Hub collection which contains the projects served at Finto AI.

Connexion dependency is upgraded to Connexion 3. From now on, when running Annif with Gunicorn, it is required to use Uvicorn workers; the workers can be set using the option --worker-class uvicorn.workers.UvicornWorker, see Connexion 3 documentation for more details. However, Docker image users do not have to add this option because an enviroment variable in the Docker image sets the worker-class. Two changes due to the upgrade to Connexion 3 relate to the REST API:

the header Access-Control-Allow-Origin: * is now included in the response only if there's an Origin header in the request, whereas before that header was sent if the Origin header was not present in the request,
the URL /v1/projects/ used to give a 404 response, but now it redirects to the correct URL /v1/projects.

Support for Python 3.8 is removed. Python 3.12 is supported except for NN-ensemble and STWFSA backends.

It is now possible to select the projects that Annif loads on startup using the environment variable ANNIF_PROJECTS_INIT, which can be useful in container environments as this allows distributing resource demand across multiple Annif processes.

Supported Python versions

3.9, 3.10 and 3.11 are fully supported
3.12 is supported except NN-ensemble and STWFSA backends

Backward compatibility

NN ensemble projects trained with Annif v1.0 or older need to be retrained; for other projects the warnings by SciKit-learn are harmless
When using Annif with Gunicorn HTTP server the worker class needs to be set to Uvicorn with the option --worker-class uvicorn.workers.UvicornWorker

Enhancements
#762/#760 Implement annif upload and annif download commands for Hugging Face Hub integration
#774/#733 Allow loading selected projects using environment variable
#736 Optimization: load a vocabulary only once even if used in different languages
#745 Show Annif version in WebUI
#751 Create SECURITY.md

Maintenance
#702/#689/#698 Upgrade to Connexion3
#780 Add partial Python 3.12 support
#770 Drop Python 3.8 support
#771/#786 Update dependencies for v1.1 release
#739 Harden GitHub Actions
#781 Make Dependabot group GitHub Actions updates into one PR
#740-#744/#750/#757/#758/#763-#766/#783 Upgrade GitHub Actions

Bug fixes
#784/#785 Add informational error message for failed loading of nn-ensemble model
#732 Fix: Add missing completion command to commands list in RTD
#773 Fix blocked http-request for version number on https site
#778 Fix project data files detection
#752 Fix tests error due to pinned Schemathesis version 3.19.* / Docker rebuild
#759 Fix installation on Python 3.8 due to missing Tensorflow-io wheel
#767 Fix tests and Docker rebuild due to defunct Schemathesis and pytest dependencies resolution
#768 Fix ReadTheDocs builds by upgrading docs build dependencies

This is a patch release that fixes bugs arisen after the Annif 1.0.1 release.

Bugs fixed:
#759 Fix installation on Python 3.8 due to missing Tensorflow-io wheel
#767 Fix tests and Docker rebuild due to defunct Schemathesis and pytest dependencies resolution

This is a patch release that fixes a bug arisen after Annif 1.0 release.

The bug affected only running unit tests, but the side-effect was that it also prevented rebuilding the Docker image of version 1.0.

Bugs fixed:
#747/#752 Tests error due to pinned schemathesis version 3.19.* / Docker rebuild fails

@cbartz

We are excited to introduce Annif version 1.0!

Advancing the version number to the 1.x series means that Annif is considered ready for more general, production use. The upcoming releases in the series (patches 1.0.x and minor feature releases 1.x.x) will be backward compatible, following the semantic versioning principle. See a Wiki page describing the aspects of the compatibility.

The changes in this release include enhancements to the command-line interface as well as many bug fixes and maintenance updates. The CLI commands, options and most parameters can now be tab-completed when the support is enabled: see instructions in README.md. Also the CLI startup time has been optimized, and the output of many commands has been refined.

Python 3.11 is now mostly supported; the Omikuji backend cannot yet be used on Python 3.11 because the Omikuji library does not support it at the moment.

From now on the Docker image of the latest release in the quay.io repository is going to be rebuilt from time to time in order to apply security updates to the image. The rebuilds will not change Annif itself. Version tags (<major>.<minor>[.<patch>]) can be used to reference the latest build of the version. To allow more strict pinning to a particular build, the images will also be tagged with the build date as a suffix: <major>.<minor>.<patch>-<YYYYMMDD>.

Supported Python versions:

3.8, 3.9 and 3.10 are fully supported
3.11 is supported except Omikuji backend

Backward compatibility:

MLLM, STWFSA and NN ensemble projects trained with Annif v0.61 or older need to be retrained; for other projects the warnings by SciKit-learn are harmless
Using STWFSA backend now requires installing an optional dependency

New features:
#684/#693 Support for CLI command completions
#703/#727 Python 3.11 support

Improvements:
#696 Optimize CLI startup time
#686/#694 Improve outputs of project inspection CLI commands
#704 Show scores in outputs of suggest, eval and index with only 4 decimals

Maintenance:
#690/#708 Use Python type hints
#699/#700 Make stwfsapy an optional dependency (credit: @cbartz)
#315/#712/#714 Add CI/CD job for testing Docker image
#707/#711 Ensure system packages are up-to-date in Docker image
#715 Add CI/CD workflow for rebuilding Docker image
#706/#725 Test CLI startup time with CI/CD job
#723 Update ReadTheDocs documentation
#726/#697/#532 Update and pin dependencies v1.0
#730 Switch to Keras v3 save format for nn_ensemble
#731 Upgrade Docker baseimage to Debian Bookworm

Bug fixes:
#705 Fix crashing index command when targeted directory contains subject files
#717 Fix Python version in GitHub Actions CI/CD pipeline
#718 Fix missing limit parameter in STWFSA backend
#722 Fix train state and modification time for unfinished project training
#720/#721 Suppress TensorFlow info messages to debug level
#695 Fix displaying of modification time for null value in Web UI project information
#701 Remove duplicated fasttext entry in optional dependencies list in Dockerfile
#728 Avoid PytestUnknownMarkWarning due to "slow" marker
#729 Avoid scikit-learn UserWarning for vectorizer parameter token_pattern

Other:
#616 Discussion on semantic versioning for Annif releases beyond 1.0

Releases: NatLibFi/Annif

Annif 1.4.1

Contributors

Uh oh!

Annif 1.4

Contributors

Uh oh!

Annif 1.3.1

Uh oh!

Annif 1.3

Contributors

Uh oh!

Annif 1.2.1

Uh oh!

Annif 1.2

Enhancements

Maintenance

Bug fixes

Contributors

Uh oh!

Annif 1.1

Uh oh!

Annif 1.0.2

Uh oh!

Annif 1.0.1

Uh oh!

Annif 1.0

Contributors

Uh oh!