Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@bertsky
Copy link
Collaborator

@bertsky bertsky commented Nov 12, 2021

In make_file_id, if the input file's ID does not contain the input fileGrp, then do not attempt to extract the numerical part of the pageId (which might still clash).

But before fallback to purely numerical ID, additionally check if the input file's ID does already contain the pageId: in that case,
only append the output fileGrp to that ID (because it is sufficiently unique already).

(Often file IDs have two numbers, one of which will clash. In that case only the numerical fallback works. On the other hand, often the file IDs from non-OCRD data contain the pageId directly, in which case it's better to stick to that convention.)

In make_file_id, if the input file's ID does not
contain the input fileGrp, then do not attempt 
to extract the numerical part of the pageId
(which might still clash); but before fallback to
purely numerical ID, additionally check if the ID
does already contain the pageId: in that case,
only append the output fileGrp to that ID.
@bertsky bertsky force-pushed the make-file-id-no-num-page-id branch from 7ba0454 to f96d3fd Compare November 12, 2021 18:27
(no numerical pageId extraction any more)
@bertsky
Copy link
Collaborator Author

bertsky commented Nov 24, 2021

Often file IDs have two numbers, one of which will clash. In that case only the numerical fallback works.

To keep that option alive, it would probably also work to just strip any non-numerical content. Let me know if this is preferable (with lower priority, right before fallback), and I'll add that change.

Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Often file IDs have two numbers, one of which will clash. In that case only the numerical fallback works.

To keep that option alive, it would probably also work to just strip any non-numerical content. Let me know if this is preferable (with lower priority, right before fallback), and I'll add that change.

IIUC this seems unnecessary. Do you have an example of an ID where this might be pertinent?

"""https://github.com/OCR-D/core/pull/605"""
mets = OcrdMets.empty_mets()
f = mets.add_file('1:!GRP', ID='FOO_0001', pageId='phys0001')
f = mets.add_file('2:!GRP', ID='FOO_0001', pageId='phys0001')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably disallow filegroups starting with a number because the resulting ID might lead to an invalid xsd:ID because they mustn't start with a number.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if #746 kicks in, such tests should all fail...

@bertsky
Copy link
Collaborator Author

bertsky commented Dec 6, 2021

Do you have an example of an ID where this might be pertinent?

I recently saw it in Konzilsprotokolle GT, e.g. 1796-97_00000024_img, and Boernerianus, e.g. img100-10.

@kba kba mentioned this pull request Dec 7, 2021
@kba kba merged commit fe99986 into master Dec 8, 2021
@kba kba deleted the make-file-id-no-num-page-id branch December 8, 2021 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants