Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kba
Copy link
Member

@kba kba commented Nov 17, 2023

There are use cases, where it might be beneficial to only operate on a subset of the file groups of an OCR-D workspace

This PR introduces an inclusion/exclusion mechanism to filter which fileGrps should be allowed to OcrdMets.find_files.

Example for cloning

ocrd -l DEBUG workspace clone -Q PRESENTATION 'https://content.staatsbibliothek-berlin.de/dc/PPN680203753.mets.xml' --download -g '//PHYS_001.'

Download for the pages in range PHYS_0010..PHYS0019 all files, except those in the PRESENTATION file group

Example: workspace bagger

E.g. to only include the DEFAULT and TESS fileGrps:

ocrd zip bag -q TESS -q DEFAULT

Or to include all fileGrps except PRESENTATION (which has the goobi URLs in our collections):

ocrd zip bag -Q PRESENTATION

Programmatically

Programmatically, the relevant kwargs are include_fileGrp and exclude_fileGrp.

Related

fixes #356
related #582
fixes #506
fixes #383

@kba kba requested a review from MehmedGIT November 17, 2023 15:09
@kba kba linked an issue Nov 17, 2023 that may be closed by this pull request
@kba kba changed the title Bagger filegrp filter Generic support for fileGrp whitelist/blacklist Nov 20, 2023
@kba kba force-pushed the bagger-filegrp-filter branch from d81543f to 2c1faec Compare November 20, 2023 13:37
Copy link
Contributor

@MehmedGIT MehmedGIT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have not tested using the CLI, but the tests already cover that.

Badly copy/pasted, hooray for code reviews

Co-authored-by: Mehmed Mustafa <[email protected]>
@kba kba merged commit b32f776 into master Nov 23, 2023
@kba kba deleted the bagger-filegrp-filter branch November 23, 2023 12:18
@kba kba mentioned this pull request Nov 23, 2023
@bertsky
Copy link
Collaborator

bertsky commented Dec 8, 2023

@kba IIUC this does not fix #506, because it merely restricts what is downloaded/zipped, not what is actually kept in the METS (as originally intended). The former also has its use-cases, so this might still be useful, but let's look at the original use-cases again:

This could be useful for:

  • sharing debug data
  • creating test data from real-life workflows
  • publishing / exporting final results
  • support for similar filtering in workspace bagging

You would still see the full refs in the METS, despite the filter. If these are URLs, then they would still be fully downloaded (at runtime) by our processors. Otherwise (i.e. if they are merely local paths), the missing refs will be useless, because there is no way of reconstructing their file content. (So why keep them in the METS anyway?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants