Releases: UW-COSMOS/Cosmos
Releases · UW-COSMOS/Cosmos
v0.8.1
v.0.8.0
v0.7.1
v0.7.0
Change base image
The previous base image was deprecated. Switching to nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04 as base.
v0.6.1 - Minor table extraction fix
- Fixed a bug where empty parquet files stopped all table extraction processing.
Table extraction, HTCosmos
New:
- Inclusion of table extraction (via
--extract-tablesoption on ingest_documents script) - HTCosmos - run COSMOS pipeline in a high-throughput mode on an HTCondor cluster
Table context enrichment, text normalization, and fixes
-
Table context enrichment during ingestion. Enabling (via the
--use-table-context-enrichmentoption on the ingest CLI) will match detected tables to mentions within the body text, adding acontext_from_textfield to the output parquet. -
The retrieval API has been updated to search either:
local_contentfield (default) - the text content of the table and its associated caption, if anyfull_contentfield -local_contentpluscontext_from_field- Any of the three fields separately (
content,caption_content,context_from_text)
-
Text normalization. Enabling (via the
--use-text-normalizationoption on the ingest CLI) will do basic unicode normalization to regularize ligature usage and mojibake issues from the text layer. -
ASKE-ID lookup within the retrieval API.
v0.4.0 - New weights; retrieval API updates
- New weights including a newer set of annotations
- Added a few necessary files for training detection + postprocessing.
- API key requirement added (though currently disabled)
- Document level lookups and filters
- Filter by dataset_id
- Store and filter on object size
- Concatenate contents and header_content field into one full_contents field and use that for retrieval