Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kba
Copy link
Member

@kba kba commented Aug 6, 2024

As laid out in #563, the clobber_mets behavior was different from what the docstring said. Instead of raising and Exception if the METS file already existed and clobber_mets being False, the METS file was silently skipped.

The easiest way would have been to just change the docstring. But the current behavior is problematic because silently skipping downloading/copying the METS XML will lead to hard-to-debug errors when users subsequently work on the wrong METS XML file.

@bertsky
Copy link
Collaborator

bertsky commented Aug 6, 2024

The easiest way would have been to just change the docstring. But the current behavior is problematic because silently skipping downloading/copying the METS XML will lead to hard-to-debug errors when users subsequently work on the wrong METS XML file.

I tend to agree, but the discussion in #425 initially outlined a different scenario:

I do not understand the use-cases for clobber_mets=False, and I know of no example of it
The original idea was that you pass a METS URL which is only downloaded the first time it is to be used. Subsequent processors should not overwrite the METS every time they process a workspace.

However, you then followed up with the following explanation, which already speaks to the current line of thought:

The METS might not be downloaded yet. That's why we introduced the idea of a workspace which was initially just a crutch to make it easier for processors and we were careful not to use it as a first-class concept e.g. in our specs. Instead of handling the intricacies of downloading full/partial representations of the work, they now just had to act on a directory and files. Since then however, workspaces have become the standard way to interact with METS/Images/PAGE and most if not all the facilities to handle non-local data outside ocrd workspace has become obsolete.

So IIUC the initial scenario of always using workspace_from_url to get a handle on METS and its contents already went out of the window, and we now have to properly differentiate between creating and using a workspace. Thus the workspace_from_url function now entirely falls into the former category, and clobber_mets=False (still the default) without a failure/exception does not make sense anymore.

Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kba
Copy link
Member Author

kba commented Aug 7, 2024

So IIUC the initial scenario of always using workspace_from_url to get a handle on METS and its contents already went out of the window, and we now have to properly differentiate between creating and using a workspace. Thus the workspace_from_url function now entirely falls into the former category, and clobber_mets=False (still the default) without a failure/exception does not make sense anymore

Yeah, it is a bit messy due to poor initial planning and yes, the "clone once then work on the local workspace" approach is the default now, the "on demand" approach is largely obsolete. But I'd keep clobber_mets=False to avoid introducing a new source of mysterious behavior. I cannot really imagine a scenario where overwriting the METS during clone would be preferable to removing the workspace manually or cloning into a different directory but the flag/kwarg is there if people do need that.

@kba kba merged commit 1c8b9db into master Aug 7, 2024
@kba kba deleted the fix-563 branch August 7, 2024 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants