-
Notifications
You must be signed in to change notification settings - Fork 33
make_file_id: no page_id number extraction #744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In make_file_id, if the input file's ID does not contain the input fileGrp, then do not attempt to extract the numerical part of the pageId (which might still clash); but before fallback to purely numerical ID, additionally check if the ID does already contain the pageId: in that case, only append the output fileGrp to that ID.
7ba0454 to
f96d3fd
Compare
(no numerical pageId extraction any more)
To keep that option alive, it would probably also work to just strip any non-numerical content. Let me know if this is preferable (with lower priority, right before fallback), and I'll add that change. |
kba
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Often file IDs have two numbers, one of which will clash. In that case only the numerical fallback works.
To keep that option alive, it would probably also work to just strip any non-numerical content. Let me know if this is preferable (with lower priority, right before fallback), and I'll add that change.
IIUC this seems unnecessary. Do you have an example of an ID where this might be pertinent?
| """https://github.com/OCR-D/core/pull/605""" | ||
| mets = OcrdMets.empty_mets() | ||
| f = mets.add_file('1:!GRP', ID='FOO_0001', pageId='phys0001') | ||
| f = mets.add_file('2:!GRP', ID='FOO_0001', pageId='phys0001') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably disallow filegroups starting with a number because the resulting ID might lead to an invalid xsd:ID because they mustn't start with a number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if #746 kicks in, such tests should all fail...
Co-authored-by: Konstantin Baierer <[email protected]>
I recently saw it in Konzilsprotokolle GT, e.g. |
In
make_file_id, if the input file's ID does not contain the input fileGrp, then do not attempt to extract the numerical part of the pageId (which might still clash).But before fallback to purely numerical ID, additionally check if the input file's ID does already contain the pageId: in that case,
only append the output fileGrp to that ID (because it is sufficiently unique already).
(Often file IDs have two numbers, one of which will clash. In that case only the numerical fallback works. On the other hand, often the file IDs from non-OCRD data contain the pageId directly, in which case it's better to stick to that convention.)