Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@MehmedGIT
Copy link
Contributor

@MehmedGIT MehmedGIT commented Feb 18, 2025

A PR draft for addressing #1294. Still in progress.

What has been added so far:

  • Support ocrd network resmgr-server --address host:port for triggering Resource Manager Server (RMS) in the background
  • For each host mentioned in the PS config file, the deployer will deploy a resource manager server on port 45555 of that host
  • Basic list-available functionality for the RMS
  • Basic list-installed functionality for the RMS
  • A partial fix for resmgr list-installed only knows about 3 processors with preconfigured resources #1251 (check the detailed comment from @bertsky here)
  • Refactoring of core resource manager

Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far so good, though I still need to understand how exactly this works

@MehmedGIT
Copy link
Contributor Author

I have difficulty deciding how the download functionality should function. There is a split of responsibilities between the OcrdResourceManager.download and the ocrd.cli.resmgr.download. The former method is used inside the latter one. If I reuse the former one, what should the endpoint itself return? It is just a file path as it is now, but paths will not matter for the end user of the endpoint. However, if I use the latter method, there are many side effects that we may not want in the download endpoint. For example, the progress bar, the resource database, etc. Moreover, the logger prints are useful for the CLI, but should they also appear in the response?

I will use the ocrd.cli.resmgr.download method to simulate the CLI experience. But that is not optimal for the ocrd_network.

@bertsky, @kba

@bertsky
Copy link
Collaborator

bertsky commented Mar 4, 2025

@MehmedGIT indeed, this calls for more aggressive refactoring. IMO the high-level ocrd.cli.resmgr.download (checking executable is installed, checking resource is available, discerning local paths from URLs, resolving target location, providing resulting name/path) should become independent of the CLI, i.e. part of ocrd.resource_manager (but not necessarily OcrdResourceManager), and get passed the logger and progress_cb as kwargs.

Perhaps like this:

def download(
    logger: Optional(logging.Logger),
    progress_cb: Callable[[int], Any],
    resmgr: OcrdResourceManager, 
    any_url: Optional[str], 
    no_dynamic: bool, 
    resource_type: str,
    path_in_archive: str,
    allow_uninstalled: bool,
    overwrite: bool,
    location: str,
    executable: str,
    name: str
):
    # current implementation of cli.resmgr.download

(or perhaps as kwargs with their own defaults, perhaps returning the fpath and usable resource name as tuple instead of just logging it)

The ResourceManagerServer could then call this to its own liking.

@MehmedGIT
Copy link
Contributor Author

MehmedGIT commented Mar 4, 2025

@bertsky, agree. For now, I will mimic the ocrd.cli.resmgr.download method and append the logger print statements to a list, which will be returned. Instead of sys.exit(1), I will raise HTTPException and will get rid of the progress bar but still keep the database till I decide how to refactor that away properly.

@bertsky
Copy link
Collaborator

bertsky commented Mar 4, 2025

Indeed, the new/refactored download would also need a general exception (just ValueError?) instead of exit.

Perhaps it's a good time to remove the entire loathed user database facility from resmgr, too? (cf. #1251)

@MehmedGIT
Copy link
Contributor Author

MehmedGIT commented Mar 4, 2025

Indeed, the new/refactored download would also need a general exception (just ValueError?) instead of exit.

HTTPException instead of ValueError; we do not want to crash the resource manager server on a wrong request, right?

Perhaps it's a good time to remove the entire loathed user database facility from resmgr, too? (cf. #1251)

I would suggest reconsidering the wildcard ('*') options as well and requiring the processor and model name. Otherwise, the return of a response may take more than 30 minutes and effectively block the server for other requests if a wildcard is used for a processor name. At least, it does with the current implementation.

@bertsky
Copy link
Collaborator

bertsky commented Mar 4, 2025

Indeed, the new/refactored download would also need a general exception (just ValueError?) instead of exit.

HTTPException instead of ValueError; we do not want to crash the resource manager server on a wrong request, right?

For your interim implementation, yes. I was again referring to the proposed refactored ocrd.resource_manager.download function (which would need to accomodate both use-cases, so the ResourceManagerServer would re-raise into HTTPException, while the CLI would just catch and exit).

Perhaps it's a good time to remove the entire loathed user database facility from resmgr, too? (cf. #1251)

I would suggest reconsidering the wildcard ('*') options as well and requiring the processor and model name. Otherwise, the return of a response may take more than 30 minutes and effectively block the server for other requests if a wildcard is used for a processor name. At least, it does with the current implementation.

Agreed, this should be blocked in the RMS.

Perhaps the PS discovery can re-implement a similar behavior for convenience in the future, but I doubt there is much use for this. (It's more likely we will script this in some form externally, maybe along with the workflow repository.)

@MehmedGIT
Copy link
Contributor Author

If @kba has no objections (since he was going to create a PR to fix the resmgr), I will do some basic refactoring without breaking the CLI behaviour.

Perhaps the PS discovery can re-implement a similar behavior for convenience in the future

I think for listing available/installed the wildcard is not an issue for now (but maybe will become after refactoring the database away).

@MehmedGIT
Copy link
Contributor Author

MehmedGIT commented Mar 6, 2025

bb0b0cd should be a fix for the missing entries for the '*' glob, i.e., #1251. I manually set the XDG_DATA_HOME and XDG_CONFIG_HOME in my environment to point to the same path.

ocrd resmgr download '*' should now yield the expected behaviour, of course, after running first ocrd resmgr list-available and potentially ocrd resmgr list-installed to update the resources.yml.

The issue was that not all entries found were saved properly in the database and to the resources.yml by invoking self.save_user_list(). Although that method is called inside self.add_to_user_database(), the latter was called inside an if case, and not everytime a new resource was found! Triggering the self.save_user_list() in the end of list_installed() and list_available() does fix it. It may not be optimal to call it every time, however, I prefer doing that and fixing the resource manager, instead of proceeding with a broken manager.

Comment on lines -194 to +195
resdict['path'] = str(res_filename)
# resdict['path'] = str(res_filename)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not store the path into the returned list of dicts anymore? (So far, it seems to be used only in print_resources, but why not?)

Copy link
Contributor Author

@MehmedGIT MehmedGIT Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bertsky, since the self.save_user_list() method writes to the resources.yml, the path occurred there as a key, which led to the failure of the OcrdResourceListValidator on consequential loads of the yaml file. Unfortunately, simply doing the saving before assigning a new path key also does not help. I did not want to make a deep copy of the entire database for the extra path key output. Neither wanted to modify the ocrd_tool.schema.yml by adding an extra path field. I will not get rid of the path completely; I just need to figure out how to achieve the same behaviour optimally. Maybe that will become clear after I refactor the database itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Yes, I also would not like to see path in the schema (as this syntax is meant to be shared between ocrd-tool.json developer descriptions and resource.yml installations).

But in this function the final save_user_list() seems redundant, because every add_to_user_database() will already invoke that (for each executable, before adding path).

I don't remember why we originally decided to save the file for every database update. It should be independent IMO.

Also, currently we do the saving twice for every processor, because add_to_user_database() also invokes list_available(), which now finally invokes save_user_list() as well.

Copy link
Contributor Author

@MehmedGIT MehmedGIT Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in this function the final save_user_list() seems redundant, because every add_to_user_database() will already invoke that (for each executable, before adding path).

True, it seems redundant, but that was part of the fix for the missing resources of #1251. Still not sure why. Also, add_to_user_database() is not called for resources found at module level.

I don't remember why we originally decided to save the file for every database update. It should be independent IMO.

Agree. I will optimize that when I get there. There are unnecessary saves to and loads from the yaml file for each discovered resource.

Also, currently we do the saving twice for every processor, because add_to_user_database() also invokes list_available(), which now finally invokes save_user_list() as well.

Right. I have also spotted that in the logs. I think a simpler search method is needed instead of relying on the list_available() method, which also has other side effects.

I also do not like the database deduplication method. Preventing duplication should perform better than adding and then trying to remove duplications afterwards.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offtopic, but just a few lines above (I cannot suggest outside of the narrow diff hunk context):

instead of...

elif str(res_filename.parent) == moduledir:

...please write...

elif str(res_filename.parent).startswith(moduledir):

(because there are many module-provided data files that are in subdirectories, which now end up as cwd resources)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in this function the final save_user_list() seems redundant, because every add_to_user_database() will already invoke that (for each executable, before adding path).

True, it seems redundant, but that was part of the fix for the missing resources of #1251.

Yes, that did help – but does not search for and add executables to the database in all cases. So I would still consider this behaviour an open matter.

Still not sure why. Also, add_to_user_database() is not called for resources found at module level.

True. The more I think about it, the less I understand what might have been the original idea behind the user database.

Clearly, @kba wanted to save the time of searching executables and resources repeatedly, hence the shortcuts. But not being able to add executables to the database seems broken, and not having module resources show up as registered/installed resources, too.

I don't remember why we originally decided to save the file for every database update. It should be independent IMO.

Agree. I will optimize that when I get there. There are unnecessary saves to and loads from the yaml file for each discovered resource.

Yes, once we have a clearer idea about the database lifetime, it should be easy to reduce file I/O.

Also, currently we do the saving twice for every processor, because add_to_user_database() also invokes list_available(), which now finally invokes save_user_list() as well.

Right. I have also spotted that in the logs. I think a simpler search method is needed instead of relying on the list_available() method, which also has other side effects.

...like short-cutting via ocrd-all-tool.json if available?

I also do not like the database deduplication method. Preventing duplication should perform better than adding and then trying to remove duplications afterwards.

As discussed in the chat, the list (instead of dict) structure (and hence deduplication) might have been meant to allow keeping additional user-defined versions of the same resource name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original idea behind writing to the database YAML was to a) speed up lookups b) automatically add new resources when doing a lookup and c) make it extensible, so users could add their own local resources.

But the resource manager currently does neither of those properly and it is very easy to create invalid data, add fixed paths that should be dynamic (module resources), have clashing names.

Also the way some of the functionality is handled in utils functions, other functionality in the manager class and yet other functionality in the CLI, is messy.

I'm open for (radical) refactoring.

@bertsky
Copy link
Collaborator

bertsky commented Mar 12, 2025

note: this PR so far does not fix the described probelms entirely (not even with #1315).

Still, new database entries (i.e. new executables) only get added when

  1. calling list_available (or ocrd resmgr list-available) with a glob pattern or executable name, and dynamic=True
  2. calling list_installed (or ocrd resmgr list-installed) with an executable name
  3. calling list_installed (or ocrd resmgr list-installed) without an executable name, if it happens to already have a directory ocrd-... under /usr/local/share/ocrd-resources or $XDG_DATA_HOME/ocrd-resources
  4. calling handle_resource (or ocrd resmgr download without -D) with an executable name

But not:

  • list_available without an executable or with dynamic=False, because it then short-circuits:
    def list_available(
    self, executable: str = None, dynamic: bool = True, name: str = None, database: Dict = None, url: str = None
    ):
    """
    List models available for download by processor
    """
    if not database:
    database = self.database
    if not executable:
    return database.items()
    if dynamic:
    self._search_executables(executable)
    self.save_user_list()
    found = False
    ret = []
    for k in database:
  • ocrd resmgr download "*" (for the same reason)
  • list_installed without an executable (for the same reason)

So ocrd resmgr download (for specific executables) now adds entries, and ocrd resmgr list-available (for the default ocrd-*) searches the PATH and adds respective entries. And for already added entries or explicit executables, in list_installed the module directory now gets searched. But we are still not doing a search in all expected circumstances. And we are nowhere utilising the list of executables in ocrd-all-tool.json if present (bypassing a PATH search).

@MehmedGIT
Copy link
Contributor Author

Noted and appended your comment as a reference to the top. Thanks!

@kba kba merged commit 2e9dab5 into master Mar 25, 2025
6 of 22 checks passed
@bertsky
Copy link
Collaborator

bertsky commented Mar 25, 2025

@kba what? wait! This was still a draft. And shouldn't you have merged #1315 into this first?

@bertsky
Copy link
Collaborator

bertsky commented Mar 25, 2025

https://github.com/OCR-D/core/releases/tag/v3.2.0 is also bugus – it does not include #1315

@MehmedGIT
Copy link
Contributor Author

MehmedGIT commented Mar 25, 2025

@kba what? wait! This was still a draft. And shouldn't you have merged #1315 into this first?

I did not react because I can also create another PR for the networt client, request forwarding over the Processing Server, etc. However, I am also fine if this PR is reverted back and I continue here. I agree about #1315.

@kba
Copy link
Member

kba commented Mar 27, 2025

Yeah, I messed up, this was not intentional. I'll try to fix it :(

@kba
Copy link
Member

kba commented Mar 27, 2025

@kba what? wait! This was still a draft. And shouldn't you have merged #1315 into this first?

I did not realize #1315 was based on #1309 was the problem.

https://github.com/OCR-D/core/releases/tag/v3.2.0 is also bugus – it does not include #1315

master (and the release) does include #1315, that PR is only open because it is based on #1309 and #1315 has not been merged into #1309.

I revert the merge and release a hotfix without it.

We'll need to rename the branch and open a new PR though, GitHub does not allow continuing working on a merged branch AFAIK.

I should (and will from now on) create release branches so I spot these oversights beforehand from now on.

@MehmedGIT
Copy link
Contributor Author

We'll need to rename the branch and open a new PR though, GitHub does not allow continuing working on a merged branch AFAIK.

@kba Whatever is faster and esier on your end. I am fine with a new brach and PR to continue working on the RM Server.

@bertsky
Copy link
Collaborator

bertsky commented Mar 27, 2025

master (and the release) does include #1315, that PR is only open because it is based on #1309 and #1315 has not been merged into #1309.

Oh, I see. Sorry, did not notice.

I revert the merge and release a hotfix without it.

That's to avoid having an incomplete version of the ResourceManagerServer in master and release, right?

If so, I'm for that.

We'll need to rename the branch and open a new PR though, GitHub does not allow continuing working on a merged branch AFAIK.

Ok, if that's required, so be it. We should be careful not to loose sight of our discussion, inasfar as it still matters, e.g.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants