Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@bertsky
Copy link
Collaborator

@bertsky bertsky commented Dec 6, 2025

use-cases for ocrd-command are pretty much every tool that can do things with PAGE e.g.

ocrd-command -P command "page-fix-coordinates @INFILE - | sed -e 's/ id=\"/ id=\"id/' -e 's/regionRef=\"/regionRef=\"id/' | transkribus-to-prima -V - @OUTFILE" -I GT -O GT-USABLE

(Building on tools from https://github.com/kba/transkribus-to-prima. The sed command just ensures that segment identifiers are valid XML ids, as is not always the case in Transkribus.)

ocrd-command -P command "page-lines2orientation @INFILE > @OUTFILE" ...
ocrd-command -P command "page-header2unordered @INFILE > @OUTFILE" ...

(Building on tools from https://github.com/bertsky/workflow-configuration. The first adds @orientation to pages by measuring average slope of annotated lines. The second slices up the ReadingOrder into UnorderedGroups at every @header region it encounters.)

ocrd-command -P command "java -jar /usr/local/share/PageConverter.jar -source-xml @INFILE -convert-to LATEST -target-xml @OUTFILE"

(Building on https://github.com/PRImA-Research-Lab/prima-page-converter, which can convert between PAGE namespace versions.)

@bertsky bertsky changed the title add builtin processor ocrd-command add builtin processors ocrd-command and ocrd-merge Dec 6, 2025
@bertsky
Copy link
Collaborator Author

bertsky commented Dec 6, 2025

use cases for ocrd-merge processor, e.g.

  • simply renaming segment IDs (to make them valid or more readable) – by running in a single input fileGrp:

      ocrd-merge -I OCR-D-OLD-IDs -O OCR-D-NEW-IDs
    
  • trivially overlaying GT and prediction (showing both reading orders in a top unordered group)

  • recombining pages that have been split up by their Border into fileGrps for left and right, e.g.

      ocrd-anybaseocr-crop -P rulerAreaMax 0 -P marginLeft 0.2 -P marginRight 0.4 -I ORIG -O CROP-L
      ocrd-anybaseocr-crop -P rulerAreaMax 0 -P marginLeft 0.6 -P marginRight 0.8 -I ORIG -O CROP-R
      ocrd-paddleocr-segment -I CROP-L -O SEG-L ...
      ocrd-paddleocr-segment -I CROP-R -O SEG-R ...
      ocrd-tesserocr-recognize -I SEG-L -O OCR-L ...
      ocrd-tesserocr-recognize -I SEG-R -O OCR-R ...
      ocrd-merge -I OCR-L,OCR-R -O OCR
    
  • recombining pages that have been split up into fileGrps to reduce image size (say 2x2 patches of huge tables or newspapers)

Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! For out-of-the-box UX and testing, would it make sense to also bundle (some? all?) of the page processing scripts you developed for ocrd-command?

@bertsky
Copy link
Collaborator Author

bertsky commented Dec 9, 2025

For out-of-the-box UX and testing, would it make sense to also bundle (some? all?) of the page processing scripts you developed for ocrd-command?

Yes. I'll add

  • a respective addition to the readme
  • a few preset jsons for the builtin processors
  • a few tests for the new builtin processors

@bertsky
Copy link
Collaborator Author

bertsky commented Dec 10, 2025

Is this sufficient documentation in your opinion, @kba?

(Perhaps the actual workflow recipes should go into the workflow guide – there is only so much you can do here without mentioning/depending on other tools and actual problems...)

@kba
Copy link
Member

kba commented Dec 10, 2025

Is this sufficient documentation in your opinion, @kba?

(Perhaps the actual workflow recipes should go into the workflow guide – there is only so much you can do here without mentioning/depending on other tools and actual problems...)

Yes, excellent, many thanks. I think the list of presets are a good starting point, they are documented and bundled. Of course, the workflow guide sorely needs an update and would benefit from this but should not be prereq for this PR.

Merging and releasing now.

@kba kba merged commit b8ea223 into OCR-D:master Dec 10, 2025
16 checks passed
@kba kba deleted the add-builtin-processors branch December 10, 2025 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants