Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@bertsky
Copy link
Collaborator

@bertsky bertsky commented Mar 12, 2025

No description provided.

@bertsky bertsky requested a review from MehmedGIT March 12, 2025 11:25
@bertsky
Copy link
Collaborator Author

bertsky commented Mar 12, 2025

I don't know how @kba originally conceived of this, but I was surprised to find that list_all_resources did normally dive into directories twice, but then gave us just the basename as resource name. This resulted in Calamari models (which are directories that have subdirectories) to be split up. The only place that needs a recursive dive are module directories IMO. But then that subpath is part of the model name, because it is resolved against the module directory. I had to refactor a bit to properly discern the base path (directory of the resource location) from the subpath (resource name).

The rest is mostly the filtering of false positives via MIME types and suffixes.

note: we still have to discuss where the log.info message for unregistered resources is more appropriate:

  • in handle_resource when there is no size key – bertsky@0635b0f
  • in add_to_user_database when it was established that this entry is new – 8b66e1b

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 12, 2025

For the CI failure I believe we just need to update repo/assets from master...

Copy link
Contributor

@MehmedGIT MehmedGIT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: we still have to discuss where the log.info message for unregistered resources is more appropriate:
* in handle_resource when there is no size key – bertsky@0635b0f
* in add_to_user_database when it was established that this entry is new – 8b66e1b

From my understanding, the database will become more consistent and avoid duplicate entries? If that is the case, I would prefer the logging to happen inside add_to_user_database.

return ret

def add_to_user_database(self, executable, res_filename, url=None, resource_type='file'):
def add_to_user_database(self, executable, res_name, res_filename, url=None, resource_type='file'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the additional parameter res_name if we can already get it from res_filename? If it is for flexibility reasons, I would suggest having res_name with a default value of Path(res_filename).name unless there is a reason not to do that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a big conceptual change I tried to explain in the comment above.

Because the name itself can be more than a file path, it can have a directory part. The base for resolving names is always some location directory, but (at least for the module location) from there it might be recursive. (So for deeper files, using just the basename as resource name would fail to resolve.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in the Tesseract case, the pretrained script models are usually referenced as e.g. script/Latin.traineddata. But we could of course flatten this in our tessdata installations. (And we could also try to get rid of the configs/ resource...)

Another example is the new ocrd_pagetopdf's data/AletheiaSans.ttf, but that could just be a silent default again.

All other module resources I can find are in the top level. So perhaps we should reconsider the recursive dive in list_all_resources at the module location – if we make that flat as well, we can switch back to the res_filename.name paradigm (and simple strings instead of location-name tuples)...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we could of course flatten this in our tessdata installations. (And we could also try to get rid of the configs/ resource...)

Not sure how complicated that would be to change.

if we make that flat as well, we can switch back to the res_filename.name paradigm (and simple strings instead of location-name tuples)...

Not necessary to switch back, I guess; Nested resources should still be handled properly by core if there was no spec guiding how to store the resources.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we could of course flatten this in our tessdata installations. (And we could also try to get rid of the configs/ resource...)

Not sure how complicated that would be to change.

We moved away from installing under tessdata/script/ a while ago. Some users might have to update their installations, but it would not be hard to enforce (code-wise).

And the configs download was just a convenience for the standalone CLI – we could just make that part of the installation procedure, without a resmgr entry.

if we make that flat as well, we can switch back to the res_filename.name paradigm (and simple strings instead of location-name tuples)...

Not necessary to switch back, I guess; Nested resources should still be handled properly by core if there was no spec guiding how to store the resources.

What is astonishing is that the current spec states pretty much the converse of what was implemented:

  • module resources must be directly under the top-level module directory
  • all other resources can be arbitrary relative paths

So if we want to stick with that, we would need to keep my location-name tuple structure, but change the iterators in list_all_resources to support recursion. (But it might be hard to get the level right: take Calamari for example again – you must not descend into fraktur_19th_century/?.ckpt/. So the actual maxlevel seems to then depend on the processor, which ocrd_utils does not know anything about.)

I am inclined to rather change the spec: flat top-level everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to rather change the spec: flat top-level everywhere.

Agree!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in accordance with OCR-D/spec#263 I changed back the retval of list_all_resources to list of str, and disabled recursive moduled discovery – all locations but CWD are flat. I also added some basic tests for list_all_resources, which I found completely lacking.

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 12, 2025

From my understanding, the database will become more consistent and avoid duplicate entries? If that is the case, I would prefer the logging to happen inside add_to_user_database.

I kept it there and removed it outside.

PR is ready from my side. We still have the other issues, this just tries to get some sense into the path resolution rules.

(otherwise "first among configured" logic does not work)
@kba
Copy link
Member

kba commented Mar 28, 2025

Merging this into #1309 branch which I'll reopen with the revert reverted. Again, sorry for the inconvenience.

@kba kba merged commit b22ae54 into OCR-D:1294-impl-rm-server Apr 1, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants