-
Notifications
You must be signed in to change notification settings - Fork 33
Processor builder outside the run_processor + cached version #972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments from my side:
So you are caching multiple processor instances, giving them time to parse parameters and load models into memory. I suppose this is all meant for the single-worker model carved out by Triet? I am asking because in multi-threading/processing it will be problematic to "steal" vital runtime objects like Process.workspace and Process.input_file_grp from the existing/cached instances. But in any case IMO you still need some function for cache cleaning. (And for GPU-enabled processors, disallowing caching or setting the cache size at runtime in order to prevent GPURAM OOM might be necessary, too.)
Furthermore, comparing to my own refactoring and networking from 2yrs ago, a couple of differences stick out. (Omitting the networking side:)
- your PR offers multiple instances each by their own parameters instead of just one (good!)
- my PR also exposes
run_apifor what in your case would berun_processor(...cached_processor=True); I recommend offering the same interface (i.e.get_processorplusrun_api - my PR catches runtime exceptions and allows to address them differently via
run_api - my PR encapsulates the CWD changes necessary during processing (storing the old CWD prior to the run, and restoring it afterwards, even in the case of failure); I believe your current draft will fail resolving input files when caching is allowed
- my PR also addresses the
task_sequence.ProcessorTaskAPI: it tries to find the class and instance from the executable name (this part is naturally messy), then uses the actual Processor instance for parameter validation (and even avoiding--dump-jsoncallouts); not sure you want this, but then clearly the Processor class itself should expose parameter valdiation (currently only in the constructor); also, IMO it is worthwhile on your side already thinking about how to backport thetask_sequence.run_tasksfunctionality (i.e.ocrd process) for processing servers (as a minimalistic/fallback/prototype use-case)
|
Hi @bertsky. The idea of this PR is to refactor the Currently, it's not possible to get a processor instance outside of |
Importing the Processor fails no matter from where it is imported.
kba
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for the caching, which we will need to revisit once the ocrd_network package is refactored, this looks like a relatively small change.
I would propose to create a draft PR for the ocrd_network stuff soon and transfer the points raised by @bertsky to the discussion of that, so we can continue discussing with the context of the full changeset.
ocrd/ocrd/processor/helpers.py
Outdated
| workspace, | ||
|
|
||
| processor = get_processor( | ||
| # TODO: Warning: processorClass of type Object gets auto casted to Type[Processor] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you mean? Is the class "downgraded" to a ocrd.Processor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is not relevant anymore if no typing is provided for the processorClass. I should have removed it. I was not sure if the auto-casting may potentially be problematic here since my IDE highlighted it as a potential problem.
|
Regarding the draft PR for the ocrd network package - that will come sometime soon. We are first trying to implement a working prototype before opening it for discussion. |
Sure, no problem, I just want to make sure we don't lose the discussion thread in a closed PR. |
ah, sry, should have commented in #884 first. (I thought I could home in from the "outside" ;-) ok, so here you merely wanted to provide an uncached version out of that, too.
But you do have
Ok, I think I understand now. I like So, I can already see most of my above points solved in #884:
But I am still missing:
|
If I directly reuse the
You said this in your previous comment, didn't you? Exactly, to avoid currently unknown for us scenarios I did not just reuse the already available
Yeah, I am usually trying to select names that potentially won't confuse others. For the same reason I like
I am still not sure whether you completely get the idea behind the PR. It's not supposed to provide a full working extension for One thing now I see I have missed is to place a documentation comment that says to not set this flag. This will prevent others getting confused and thinking it's safe to be set. And, probably, that's why you are confused a bit. Another potential reason for your confusion is probably the fact that #884 was dropped and won't be merged to core. It's there only for reference to me and Jonas while the |
I can see that. What I asked for is some additional mechanism which can then be used via Web API for resetting the cache (using (Also, perhaps
I am not confused (any more). I have already stated that these latter points are about the direction #884 is going. And since we are already discussing them here, and AFAICS there is no other PR to discuss |
Agree, I have changed it.
We are almost there with the PR for the Processing Server. The current draft PR: #974
Providing endpoints for that is not a problem, but isn't this something to be handled internally and not through the Web API? I am still not sure how should the I have also found this: an interesting article regarding garbage collection with caching which may produce problems in the future if not properly taken care. The #972 should now be ready to be merged. |
Co-authored-by: Robert Sachunsky <[email protected]>
I stand corrected. For management of processor instances, no Web API or web anything is necessary. If it's the Processing Worker (in the queue model) or the Processing Server (in the integrated model), that component can just control its internal cache references via Python API.
Ideally, it enforces some (preconfigured) max-workers size, and updates the cache whenever instances crash. For GPU workers, the memory consumption must also be watched – but by adding the envvar, you already gave the admin some control.
Agreed. |
|
CI failure is only for macosx, merging. |
I have created a separate method to get the desired processor class. Both cached and non-cached types are supported. By default, the non-cached processor is returned to preserve the current state. It's possible to try the cached version by setting the additional flag parameter to True.