fix name vs subdir, add type checking for resource candidates #1315

bertsky · 2025-03-12T11:25:01Z

No description provided.

…by comma

…be nested paths

…cessor.list_all_resources and ResourceManager.list_installed

…-type-checking

…chive if necessary

bertsky · 2025-03-12T11:34:55Z

I don't know how @kba originally conceived of this, but I was surprised to find that list_all_resources did normally dive into directories twice, but then gave us just the basename as resource name. This resulted in Calamari models (which are directories that have subdirectories) to be split up. The only place that needs a recursive dive are module directories IMO. But then that subpath is part of the model name, because it is resolved against the module directory. I had to refactor a bit to properly discern the base path (directory of the resource location) from the subpath (resource name).

The rest is mostly the filtering of false positives via MIME types and suffixes.

note: we still have to discuss where the log.info message for unregistered resources is more appropriate:

in handle_resource when there is no size key – bertsky@0635b0f
in add_to_user_database when it was established that this entry is new – 8b66e1b

bertsky · 2025-03-12T11:36:32Z

For the CI failure I believe we just need to update repo/assets from master...

MehmedGIT

note: we still have to discuss where the log.info message for unregistered resources is more appropriate:
* in handle_resource when there is no size key – bertsky@0635b0f
* in add_to_user_database when it was established that this entry is new – 8b66e1b

From my understanding, the database will become more consistent and avoid duplicate entries? If that is the case, I would prefer the logging to happen inside add_to_user_database.

src/ocrd/resource_manager.py

MehmedGIT · 2025-03-12T13:22:25Z

src/ocrd/resource_manager.py

        return ret

-    def add_to_user_database(self, executable, res_filename, url=None, resource_type='file'):
+    def add_to_user_database(self, executable, res_name, res_filename, url=None, resource_type='file'):


Why do we need the additional parameter res_name if we can already get it from res_filename? If it is for flexibility reasons, I would suggest having res_name with a default value of Path(res_filename).name unless there is a reason not to do that.

That's a big conceptual change I tried to explain in the comment above.

Because the name itself can be more than a file path, it can have a directory part. The base for resolving names is always some location directory, but (at least for the module location) from there it might be recursive. (So for deeper files, using just the basename as resource name would fail to resolve.)

For example, in the Tesseract case, the pretrained script models are usually referenced as e.g. script/Latin.traineddata. But we could of course flatten this in our tessdata installations. (And we could also try to get rid of the configs/ resource...)

Another example is the new ocrd_pagetopdf's data/AletheiaSans.ttf, but that could just be a silent default again.

All other module resources I can find are in the top level. So perhaps we should reconsider the recursive dive in list_all_resources at the module location – if we make that flat as well, we can switch back to the res_filename.name paradigm (and simple strings instead of location-name tuples)...

But we could of course flatten this in our tessdata installations. (And we could also try to get rid of the configs/ resource...)

Not sure how complicated that would be to change.

if we make that flat as well, we can switch back to the res_filename.name paradigm (and simple strings instead of location-name tuples)...

Not necessary to switch back, I guess; Nested resources should still be handled properly by core if there was no spec guiding how to store the resources.

But we could of course flatten this in our tessdata installations. (And we could also try to get rid of the configs/ resource...)

Not sure how complicated that would be to change.

We moved away from installing under tessdata/script/ a while ago. Some users might have to update their installations, but it would not be hard to enforce (code-wise).

And the configs download was just a convenience for the standalone CLI – we could just make that part of the installation procedure, without a resmgr entry.

if we make that flat as well, we can switch back to the res_filename.name paradigm (and simple strings instead of location-name tuples)...

Not necessary to switch back, I guess; Nested resources should still be handled properly by core if there was no spec guiding how to store the resources.

What is astonishing is that the current spec states pretty much the converse of what was implemented:

module resources must be directly under the top-level module directory

all other resources can be arbitrary relative paths

So if we want to stick with that, we would need to keep my location-name tuple structure, but change the iterators in list_all_resources to support recursion. (But it might be hard to get the level right: take Calamari for example again – you must not descend into fraktur_19th_century/?.ckpt/. So the actual maxlevel seems to then depend on the processor, which ocrd_utils does not know anything about.)

I am inclined to rather change the spec: flat top-level everywhere.

I am inclined to rather change the spec: flat top-level everywhere.

Agree!

So in accordance with OCR-D/spec#263 I changed back the retval of list_all_resources to list of str, and disabled recursive moduled discovery – all locations but CWD are flat. I also added some basic tests for list_all_resources, which I found completely lacking.

src/ocrd_utils/os.py

…even in moduled)

bertsky · 2025-03-12T22:18:54Z

From my understanding, the database will become more consistent and avoid duplicate entries? If that is the case, I would prefer the logging to happen inside add_to_user_database.

I kept it there and removed it outside.

PR is ready from my side. We still have the other issues, this just tries to get some sense into the path resolution rules.

(otherwise "first among configured" logic does not work)

kba · 2025-03-28T14:57:51Z

Merging this into #1309 branch which I'll reopen with the revert reverted. Again, sorry for the inconvenience.

Robert Sachunsky added 17 commits March 11, 2025 17:14

ocrd_utils.guess_media_type: fix typo in exception

a32043a

ocrd_utils.list_all_resources: add more anti-patterns

d2fdd68

ocrd_utils.get_processor_resource_types: split content-type patterns …

47abfdc

…by comma

ocrd_utils.os: type hints, fix cwd kwarg default

aa4175a

ocrd_utils.list_all_resources: do not descend directories again

4c5dc28

ocrd_validators.ocrd_tool.schema.yml: update from spec

5242fa3

list_all_resources: yet more anti-patterns

4eaa6f3

resmgr _download_resource | _copy_resource: fix arg order typo

5b833da

resmgr add_to_user_database: only log.info if actually unregistered

8b66e1b

resmgr list_available: fix retval ambiguity (dict_items→list)

8232d6d

list_all_resources: add location dir to retval as resource names can …

8a9777f

…be nested paths

list_all_resources: match file resources by content-type, if available

7c8840c

match file resources by content-type: delegate to ocrd_utils from Pro…

a855488

…cessor.list_all_resources and ResourceManager.list_installed

Merge remote-tracking branch 'origin/1294-impl-rm-server' into resmgr…

ab93982

…-type-checking

resmgr add_to_user_database needs to discern name and path, too

deeb505

resmgr add_to_user_database: forgot ensure db key exists

9ff2013

resmgr handle_resource: if no name was passed, get it from path_in_ar…

540c9d1

…chive if necessary

bertsky requested a review from MehmedGIT March 12, 2025 11:25

Robert Sachunsky added 2 commits March 12, 2025 12:25

list_installed needs str not Path

d44c762

resmgr add repr()

1c98e4e

bertsky mentioned this pull request Mar 12, 2025

Implementation of the resource manager server (issue #1294) #1309

Merged

MehmedGIT approved these changes Mar 12, 2025

View reviewed changes

Robert Sachunsky added 2 commits March 12, 2025 23:10

list_all_resources: only names at all non-CWD location (no recursion …

ed7c60b

…even in moduled)

test_os: add cases for list_all_resources

7e5d508

Merge branch '1294-impl-rm-server' into resmgr-type-checking

2e9dab5

kba mentioned this pull request Mar 27, 2025

Revert "Merge remote-tracking branch 'bertsky/resmgr-type-checking'" #1317

Merged

resmgr download: revert default location

f765a96

(otherwise "first among configured" logic does not work)

bertsky added 2 commits March 27, 2025 19:14

resmgr download: improve help str for fallback attrs

0e09546

Update CHANGELOG.md

c9d8b41

kba mentioned this pull request Mar 28, 2025

Continuation of #1309: Implementation of the resource manager server (issue #1294) #1319

Merged

kba merged commit b22ae54 into OCR-D:1294-impl-rm-server Apr 1, 2025
1 check passed

fix name vs subdir, add type checking for resource candidates #1315

fix name vs subdir, add type checking for resource candidates #1315

Uh oh!

Conversation

bertsky commented Mar 12, 2025

Uh oh!

bertsky commented Mar 12, 2025

Uh oh!

bertsky commented Mar 12, 2025

Uh oh!

MehmedGIT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MehmedGIT Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

bertsky Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

bertsky Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

MehmedGIT Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

bertsky Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

MehmedGIT Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

bertsky Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bertsky commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kba commented Mar 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bertsky commented Mar 12, 2025 •

edited

Loading