Codestin Search App

Adding sections endpoint to COSMOS service. No changes to core COSMOS workflow

Adding initial watermark removal

With some of the build updates, CPU fixes, and reasonable docker image base.

The previous base image was deprecated. Switching to nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04 as base.

New:

Inclusion of table extraction (via --extract-tables option on ingest_documents script)
HTCosmos - run COSMOS pipeline in a high-throughput mode on an HTCondor cluster

Table context enrichment during ingestion. Enabling (via the --use-table-context-enrichment option on the ingest CLI) will match detected tables to mentions within the body text, adding a context_from_text field to the output parquet.
The retrieval API has been updated to search either:
- local_content field (default) - the text content of the table and its associated caption, if any
- full_content field - local_content plus context_from_field
- Any of the three fields separately (content, caption_content, context_from_text)
Text normalization. Enabling (via the --use-text-normalization option on the ingest CLI) will do basic unicode normalization to regularize ligature usage and mojibake issues from the text layer.
ASKE-ID lookup within the retrieval API.

New weights including a newer set of annotations
Added a few necessary files for training detection + postprocessing.
API key requirement added (though currently disabled)
Document level lookups and filters
Filter by dataset_id
Store and filter on object size
Concatenate contents and header_content field into one full_contents field and use that for retrieval

Modular pipeline with new workflow definitions, cli, unicode (#122)
Initial entity linking using SciSpacy (#135)
- Entity recognition + linking to UMLS entities
Initial semantic context for tables (#137)
(ongoing) documentation to match

Releases: UW-COSMOS/Cosmos