Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@cneud
Copy link
Member

@cneud cneud commented May 23, 2022

This is meant as a replacement for/supersedes 154 and 155 since the changes and discussion there became a bit fragmented and difficult to review.

It integrates all additions from both PRs into a new mets.md with this structure:

  1. Metadata
    1.1 Unique ID for the document processed
    1.2 Always use URL or relative filenames
    1.3 Recording processing information in METS
  2. Images
    2.1 If in PAGE then in METS
    2.2 Pixel density of images must be explicit and high enough
    2.3 No multi-page images
    2.4 Images and coordinates
  3. File group mets:fileGrp
    3.1 File Group @USE syntax
    3.2 File Group @USE="FULLDOWNLOAD_..."
  4. File mets:file
    4.1 File ID syntax
    4.2 @MIMETYPE syntax
  5. Grouping files by page mets:structMap
    5.1 Grouping files by page
    5.2 OCR-D mets:structMap
  6. Ranges of pages mets:structLink

I hope I have not missed anything and that this will allow us to soon integrate these changes.

Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This changes the fragment URL to point to headings though, so I'll add (invisible) anchors for the previous URLs.

they changed the METS. This information is mainly for human consumption to get
an overview of the software agents involved in the METS file's creation. More
detailed or machine-actionable provenance information is outside the scope of
the processor.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...outside the scope of the processor, or just the spec?

The wording about "machine-actionable" should be revisited along with #108 BTW.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...outside the scope of the processor, or just the spec?

This goes way back to 2018 and then meant "the module projects do not have to implement this".

These days I tend to run processors or ocrd process with |tee <name-of-processor>.log to retain the logs. Not actionable but more complete than either the mets:agent we add to METS or the pc:processingSteps we add to PAGE-XML, the latter being pretty close to being machine-actionable in the sense that they have the complete set of parameters.

We could add at least the processingStep mechanism to the specifications.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add at least the processingStep mechanism to the specifications.

+1

Copy link
Contributor

@tboenig tboenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the typo with pdf ->tei,
otherwise I think it's good 👍

kba and others added 2 commits August 16, 2022 12:37
Co-authored-by: Robert Sachunsky <[email protected]>
@kba kba merged commit dc2341d into OCR-D:master Aug 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants