-
Notifications
You must be signed in to change notification settings - Fork 33
fix name vs subdir, add type checking for resource candidates #1315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…cessor.list_all_resources and ResourceManager.list_installed
…chive if necessary
|
I don't know how @kba originally conceived of this, but I was surprised to find that The rest is mostly the filtering of false positives via MIME types and suffixes. note: we still have to discuss where the log.info message for unregistered resources is more appropriate:
|
|
For the CI failure I believe we just need to update repo/assets from master... |
MehmedGIT
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: we still have to discuss where the log.info message for unregistered resources is more appropriate:
* in handle_resource when there is no size key – bertsky@0635b0f
* in add_to_user_database when it was established that this entry is new – 8b66e1b
From my understanding, the database will become more consistent and avoid duplicate entries? If that is the case, I would prefer the logging to happen inside add_to_user_database.
src/ocrd/resource_manager.py
Outdated
| return ret | ||
|
|
||
| def add_to_user_database(self, executable, res_filename, url=None, resource_type='file'): | ||
| def add_to_user_database(self, executable, res_name, res_filename, url=None, resource_type='file'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the additional parameter res_name if we can already get it from res_filename? If it is for flexibility reasons, I would suggest having res_name with a default value of Path(res_filename).name unless there is a reason not to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a big conceptual change I tried to explain in the comment above.
Because the name itself can be more than a file path, it can have a directory part. The base for resolving names is always some location directory, but (at least for the module location) from there it might be recursive. (So for deeper files, using just the basename as resource name would fail to resolve.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, in the Tesseract case, the pretrained script models are usually referenced as e.g. script/Latin.traineddata. But we could of course flatten this in our tessdata installations. (And we could also try to get rid of the configs/ resource...)
Another example is the new ocrd_pagetopdf's data/AletheiaSans.ttf, but that could just be a silent default again.
All other module resources I can find are in the top level. So perhaps we should reconsider the recursive dive in list_all_resources at the module location – if we make that flat as well, we can switch back to the res_filename.name paradigm (and simple strings instead of location-name tuples)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we could of course flatten this in our tessdata installations. (And we could also try to get rid of the configs/ resource...)
Not sure how complicated that would be to change.
if we make that flat as well, we can switch back to the res_filename.name paradigm (and simple strings instead of location-name tuples)...
Not necessary to switch back, I guess; Nested resources should still be handled properly by core if there was no spec guiding how to store the resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we could of course flatten this in our tessdata installations. (And we could also try to get rid of the configs/ resource...)
Not sure how complicated that would be to change.
We moved away from installing under tessdata/script/ a while ago. Some users might have to update their installations, but it would not be hard to enforce (code-wise).
And the configs download was just a convenience for the standalone CLI – we could just make that part of the installation procedure, without a resmgr entry.
if we make that flat as well, we can switch back to the res_filename.name paradigm (and simple strings instead of location-name tuples)...
Not necessary to switch back, I guess; Nested resources should still be handled properly by
coreif there was no spec guiding how to store the resources.
What is astonishing is that the current spec states pretty much the converse of what was implemented:
- module resources must be directly under the top-level module directory
- all other resources can be arbitrary relative paths
So if we want to stick with that, we would need to keep my location-name tuple structure, but change the iterators in list_all_resources to support recursion. (But it might be hard to get the level right: take Calamari for example again – you must not descend into fraktur_19th_century/?.ckpt/. So the actual maxlevel seems to then depend on the processor, which ocrd_utils does not know anything about.)
I am inclined to rather change the spec: flat top-level everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am inclined to rather change the spec: flat top-level everywhere.
Agree!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in accordance with OCR-D/spec#263 I changed back the retval of list_all_resources to list of str, and disabled recursive moduled discovery – all locations but CWD are flat. I also added some basic tests for list_all_resources, which I found completely lacking.
I kept it there and removed it outside. PR is ready from my side. We still have the other issues, this just tries to get some sense into the path resolution rules. |
(otherwise "first among configured" logic does not work)
|
Merging this into #1309 branch which I'll reopen with the revert reverted. Again, sorry for the inconvenience. |
No description provided.