-
Notifications
You must be signed in to change notification settings - Fork 33
Implementation of the resource manager server (issue #1294) #1309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kba
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far so good, though I still need to understand how exactly this works
|
I have difficulty deciding how the I will use the |
This is required in order to fix: Not supported URL scheme http+docker
|
@MehmedGIT indeed, this calls for more aggressive refactoring. IMO the high-level Perhaps like this: def download(
logger: Optional(logging.Logger),
progress_cb: Callable[[int], Any],
resmgr: OcrdResourceManager,
any_url: Optional[str],
no_dynamic: bool,
resource_type: str,
path_in_archive: str,
allow_uninstalled: bool,
overwrite: bool,
location: str,
executable: str,
name: str
):
# current implementation of cli.resmgr.download(or perhaps as kwargs with their own defaults, perhaps returning the fpath and usable resource name as tuple instead of just logging it) The ResourceManagerServer could then call this to its own liking. |
|
@bertsky, agree. For now, I will mimic the |
|
Indeed, the new/refactored Perhaps it's a good time to remove the entire loathed user database facility from resmgr, too? (cf. #1251) |
HTTPException instead of ValueError; we do not want to crash the resource manager server on a wrong request, right?
I would suggest reconsidering the wildcard ( |
For your interim implementation, yes. I was again referring to the proposed refactored
Agreed, this should be blocked in the RMS. Perhaps the PS discovery can re-implement a similar behavior for convenience in the future, but I doubt there is much use for this. (It's more likely we will script this in some form externally, maybe along with the workflow repository.) |
|
If @kba has no objections (since he was going to create a PR to fix the resmgr), I will do some basic refactoring without breaking the CLI behaviour.
I think for listing available/installed the wildcard is not an issue for now (but maybe will become after refactoring the database away). |
|
bb0b0cd should be a fix for the missing entries for the
The issue was that not all entries found were saved properly in the database and to the resources.yml by invoking |
| resdict['path'] = str(res_filename) | ||
| # resdict['path'] = str(res_filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not store the path into the returned list of dicts anymore? (So far, it seems to be used only in print_resources, but why not?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bertsky, since the self.save_user_list() method writes to the resources.yml, the path occurred there as a key, which led to the failure of the OcrdResourceListValidator on consequential loads of the yaml file. Unfortunately, simply doing the saving before assigning a new path key also does not help. I did not want to make a deep copy of the entire database for the extra path key output. Neither wanted to modify the ocrd_tool.schema.yml by adding an extra path field. I will not get rid of the path completely; I just need to figure out how to achieve the same behaviour optimally. Maybe that will become clear after I refactor the database itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. Yes, I also would not like to see path in the schema (as this syntax is meant to be shared between ocrd-tool.json developer descriptions and resource.yml installations).
But in this function the final save_user_list() seems redundant, because every add_to_user_database() will already invoke that (for each executable, before adding path).
I don't remember why we originally decided to save the file for every database update. It should be independent IMO.
Also, currently we do the saving twice for every processor, because add_to_user_database() also invokes list_available(), which now finally invokes save_user_list() as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in this function the final save_user_list() seems redundant, because every add_to_user_database() will already invoke that (for each executable, before adding path).
True, it seems redundant, but that was part of the fix for the missing resources of #1251. Still not sure why. Also, add_to_user_database() is not called for resources found at module level.
I don't remember why we originally decided to save the file for every database update. It should be independent IMO.
Agree. I will optimize that when I get there. There are unnecessary saves to and loads from the yaml file for each discovered resource.
Also, currently we do the saving twice for every processor, because add_to_user_database() also invokes list_available(), which now finally invokes save_user_list() as well.
Right. I have also spotted that in the logs. I think a simpler search method is needed instead of relying on the list_available() method, which also has other side effects.
I also do not like the database deduplication method. Preventing duplication should perform better than adding and then trying to remove duplications afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offtopic, but just a few lines above (I cannot suggest outside of the narrow diff hunk context):
instead of...
elif str(res_filename.parent) == moduledir:...please write...
elif str(res_filename.parent).startswith(moduledir):(because there are many module-provided data files that are in subdirectories, which now end up as cwd resources)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in this function the final save_user_list() seems redundant, because every add_to_user_database() will already invoke that (for each executable, before adding path).
True, it seems redundant, but that was part of the fix for the missing resources of #1251.
Yes, that did help – but does not search for and add executables to the database in all cases. So I would still consider this behaviour an open matter.
Still not sure why. Also,
add_to_user_database()is not called for resources found at module level.
True. The more I think about it, the less I understand what might have been the original idea behind the user database.
Clearly, @kba wanted to save the time of searching executables and resources repeatedly, hence the shortcuts. But not being able to add executables to the database seems broken, and not having module resources show up as registered/installed resources, too.
I don't remember why we originally decided to save the file for every database update. It should be independent IMO.
Agree. I will optimize that when I get there. There are unnecessary saves to and loads from the yaml file for each discovered resource.
Yes, once we have a clearer idea about the database lifetime, it should be easy to reduce file I/O.
Also, currently we do the saving twice for every processor, because add_to_user_database() also invokes list_available(), which now finally invokes save_user_list() as well.
Right. I have also spotted that in the logs. I think a simpler search method is needed instead of relying on the
list_available()method, which also has other side effects.
...like short-cutting via ocrd-all-tool.json if available?
I also do not like the database deduplication method. Preventing duplication should perform better than adding and then trying to remove duplications afterwards.
As discussed in the chat, the list (instead of dict) structure (and hence deduplication) might have been meant to allow keeping additional user-defined versions of the same resource name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original idea behind writing to the database YAML was to a) speed up lookups b) automatically add new resources when doing a lookup and c) make it extensible, so users could add their own local resources.
But the resource manager currently does neither of those properly and it is very easy to create invalid data, add fixed paths that should be dynamic (module resources), have clashing names.
Also the way some of the functionality is handled in utils functions, other functionality in the manager class and yet other functionality in the CLI, is messy.
I'm open for (radical) refactoring.
|
note: this PR so far does not fix the described probelms entirely (not even with #1315). Still, new database entries (i.e. new executables) only get added when
But not:
So |
|
Noted and appended your comment as a reference to the top. Thanks! |
|
https://github.com/OCR-D/core/releases/tag/v3.2.0 is also bugus – it does not include #1315 |
I did not react because I can also create another PR for the networt client, request forwarding over the Processing Server, etc. However, I am also fine if this PR is reverted back and I continue here. I agree about #1315. |
|
Yeah, I messed up, this was not intentional. I'll try to fix it :( |
I did not realize #1315 was based on #1309 was the problem.
I revert the merge and release a hotfix without it. We'll need to rename the branch and open a new PR though, GitHub does not allow continuing working on a merged branch AFAIK. I should (and will from now on) create release branches so I spot these oversights beforehand from now on. |
@kba Whatever is faster and esier on your end. I am fine with a new brach and PR to continue working on the RM Server. |
Oh, I see. Sorry, did not notice.
That's to avoid having an incomplete version of the ResourceManagerServer in master and release, right? If so, I'm for that.
Ok, if that's required, so be it. We should be careful not to loose sight of our discussion, inasfar as it still matters, e.g.
|
A PR draft for addressing #1294. Still in progress.
What has been added so far:
ocrd network resmgr-server --address host:portfor triggering Resource Manager Server (RMS) in the backgroundlist-availablefunctionality for the RMSlist-installedfunctionality for the RMS