Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@MehmedGIT
Copy link
Contributor

I have created a separate method to get the desired processor class. Both cached and non-cached types are supported. By default, the non-cached processor is returned to preserve the current state. It's possible to try the cached version by setting the additional flag parameter to True.

Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments from my side:

So you are caching multiple processor instances, giving them time to parse parameters and load models into memory. I suppose this is all meant for the single-worker model carved out by Triet? I am asking because in multi-threading/processing it will be problematic to "steal" vital runtime objects like Process.workspace and Process.input_file_grp from the existing/cached instances. But in any case IMO you still need some function for cache cleaning. (And for GPU-enabled processors, disallowing caching or setting the cache size at runtime in order to prevent GPURAM OOM might be necessary, too.)

Furthermore, comparing to my own refactoring and networking from 2yrs ago, a couple of differences stick out. (Omitting the networking side:)

  • your PR offers multiple instances each by their own parameters instead of just one (good!)
  • my PR also exposes run_api for what in your case would be run_processor(...cached_processor=True); I recommend offering the same interface (i.e. get_processor plus run_api
  • my PR catches runtime exceptions and allows to address them differently via run_api
  • my PR encapsulates the CWD changes necessary during processing (storing the old CWD prior to the run, and restoring it afterwards, even in the case of failure); I believe your current draft will fail resolving input files when caching is allowed
  • my PR also addresses the task_sequence.ProcessorTask API: it tries to find the class and instance from the executable name (this part is naturally messy), then uses the actual Processor instance for parameter validation (and even avoiding --dump-json callouts); not sure you want this, but then clearly the Processor class itself should expose parameter valdiation (currently only in the constructor); also, IMO it is worthwhile on your side already thinking about how to backport the task_sequence.run_tasks functionality (i.e. ocrd process) for processing servers (as a minimalistic/fallback/prototype use-case)

@MehmedGIT
Copy link
Contributor Author

Hi @bertsky. The idea of this PR is to refactor the run_processor method and hide the instantiation of an OCR-D processor behind a separate method. I thought it would be bad to just use the cached version from Triet's PR (#884) and decided to make it flexible - keep the default as it is now but still provide a way to use the cached version.

Currently, it's not possible to get a processor instance outside of run_processor and we need such a method in the network package we are implementing. The ocr-d network to realize the architecture proposed by Triet. Inside the network package we have our ProcessingWorker agent which indeed will have it's own wrappers for run_cli and run_processor, e.g. run_cli_from_worker and run_processor_from_worker, to address the things you mentioned in bullet points.
We are trying to implement this entire networking thing without altering the existing and working code base as much as possible.

Importing the Processor fails no matter from where it is imported.
Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for the caching, which we will need to revisit once the ocrd_network package is refactored, this looks like a relatively small change.

I would propose to create a draft PR for the ocrd_network stuff soon and transfer the points raised by @bertsky to the discussion of that, so we can continue discussing with the context of the full changeset.

workspace,

processor = get_processor(
# TODO: Warning: processorClass of type Object gets auto casted to Type[Processor]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you mean? Is the class "downgraded" to a ocrd.Processor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is not relevant anymore if no typing is provided for the processorClass. I should have removed it. I was not sure if the auto-casting may potentially be problematic here since my IDE highlighted it as a potential problem.

@MehmedGIT
Copy link
Contributor Author

Regarding the draft PR for the ocrd network package - that will come sometime soon. We are first trying to implement a working prototype before opening it for discussion.

@kba
Copy link
Member

kba commented Jan 17, 2023

Regarding the draft PR for the ocrd network package - that will come sometime soon. We are first trying to implement a working prototype before opening it for discussion.

Sure, no problem, I just want to make sure we don't lose the discussion thread in a closed PR.

@bertsky
Copy link
Collaborator

bertsky commented Jan 21, 2023

The idea of this PR is to refactor the run_processor method and hide the instantiation of an OCR-D processor behind a separate method. I thought it would be bad to just use the cached version from Triet's PR (#884) and decided to make it flexible - keep the default as it is now but still provide a way to use the cached version.

ah, sry, should have commented in #884 first. (I thought I could home in from the "outside" ;-)

ok, so here you merely wanted to provide an uncached version out of that, too.

Currently, it's not possible to get a processor instance outside of run_processor and we need such a method in the network package we are implementing.

But you do have get_processor already!

The ocr-d network to realize the architecture proposed by Triet. Inside the network package we have our ProcessingWorker agent which indeed will have it's own wrappers for run_cli and run_processor, e.g. run_cli_from_worker and run_processor_from_worker, to address the things you mentioned in bullet points.

Ok, I think I understand now. I like run_*_from_worker better than run_*_from_api BTW. (To me, API would by default stand for Python API, not Web API.)

So, I can already see most of my above points solved in #884:

  • CWD switching
  • job validation prior to actual runtime
  • exception handling (logging + conditionally re-exposing)

But I am still missing:

  • resetting CWD in case of failure (in my implementation, resetting CWD was part of the inner finally clause; but in your implementation, the CWD stuff is in run_processor and the exception handling happens in the caller)
  • some function for (processor instance) cache cleaning. (And for GPU-enabled Processor heirs, disallowing such caching or setting the cache size at runtime in order to prevent GPURAM OOM)

@MehmedGIT
Copy link
Contributor Author

But you do have get_processor already!

If I directly reuse the get_processor from #884 then it may end up (potentially) causing problems (which you already covered some) because of the caching.

But in any case IMO you still need some function for cache cleaning. (And for GPU-enabled processors, disallowing caching or setting the cache size at runtime in order to prevent GPURAM OOM might be necessary, too.)

You said this in your previous comment, didn't you? Exactly, to avoid currently unknown for us scenarios I did not just reuse the already available get_processor directly! Instead, I renamed that method to get_cached_processor and wrapped it in a method get_processor which based on a flag returns the cached/non-cached processor. Then reused that method inside run_processor.

Ok, I think I understand now. I like run_from_worker better than run_from_api BTW. (To me, API would by default stand for Python API, not Web API.)

Yeah, I am usually trying to select names that potentially won't confuse others. For the same reason I like processing worker more than processing server when we are referring to an agent that is in fact just an ocr-d processor which is not exposed to the user directly, but indirectly through a processing broker aka processing server with the new term.

But I am still missing:

I am still not sure whether you completely get the idea behind the PR. It's not supposed to provide a full working extension for run_processor that covers all requirements you have mentioned. The main question is: Does this PR break the core in any way? Of course, assuming that the cached_processor flag of run_processor is never set to True. If yes, let me know what is problematic and missing?

One thing now I see I have missed is to place a documentation comment that says to not set this flag. This will prevent others getting confused and thinking it's safe to be set. And, probably, that's why you are confused a bit. Another potential reason for your confusion is probably the fact that #884 was dropped and won't be merged to core. It's there only for reference to me and Jonas while the ocr-d network package is being implemented.

@bertsky
Copy link
Collaborator

bertsky commented Jan 23, 2023

If I directly reuse the get_processor from #884 then it may end up (potentially) causing problems (which you already covered some) because of the caching.
... to avoid currently unknown for us scenarios I did not just reuse the already available get_processor directly! Instead, I renamed that method to get_cached_processor and wrapped it in a method get_processor which based on a flag returns the cached/non-cached processor. Then reused that method inside run_processor.

I can see that. What I asked for is some additional mechanism which can then be used via Web API for resetting the cache (using get_cached_processor.cache_clear()) and changing the size (using get_cached_processor.cache_info().maxsize = ...).

(Also, perhaps instance_caching would be a better kwarg than cached_processor?)

But I am still missing: ...

I am still not sure whether you completely get the idea behind the PR. It's not supposed to provide a full working extension for run_processor that covers all requirements you have mentioned. The main question is: Does this PR break the core in any way? Of course, assuming that the cached_processor flag of run_processor is never set to True. If yes, let me know what is problematic and missing?

One thing now I see I have missed is to place a documentation comment that says to not set this flag. This will prevent others getting confused and thinking it's safe to be set. And, probably, that's why you are confused a bit. Another potential reason for your confusion is probably the fact that #884 was dropped and won't be merged to core. It's there only for reference to me and Jonas while the ocr-d network package is being implemented.

I am not confused (any more). I have already stated that these latter points are about the direction #884 is going. And since we are already discussing them here, and AFAICS there is no other PR to discuss dev-processing-broker and ocrd-webapi-implementation yet, why not?

@kba kba marked this pull request as ready for review February 13, 2023 14:22
@MehmedGIT
Copy link
Contributor Author

MehmedGIT commented Feb 14, 2023

(Also, perhaps instance_caching would be a better kwarg than cached_processor?)

Agree, I have changed it.

And since we are already discussing them here, and AFAICS there is no other PR to discuss dev-processing-broker and ocrd-webapi-implementation yet, why not?

We are almost there with the PR for the Processing Server. The current draft PR: #974

I can see that. What I asked for is some additional mechanism which can then be used via Web API for resetting the cache (using get_cached_processor.cache_clear()) and changing the size (using get_cached_processor.cache_info().maxsize = ...).

Providing endpoints for that is not a problem, but isn't this something to be handled internally and not through the Web API? I am still not sure how should the .cache_clear() or changing the .cache_info().maxsize = ... should be handled by an already running processing worker. There may be other complications arising from modifying the cache from the outside. We should further discuss this in #974 once the PR is ready.

I have also found this: an interesting article regarding garbage collection with caching which may produce problems in the future if not properly taken care.

The #972 should now be ready to be merged.

@bertsky
Copy link
Collaborator

bertsky commented Feb 14, 2023

What I asked for is some additional mechanism which can then be used via Web API for resetting the cache (using get_cached_processor.cache_clear()) and changing the size (using get_cached_processor.cache_info().maxsize = ...).

Providing endpoints for that is not a problem, but isn't this something to be handled internally and not through the Web API?

I stand corrected. For management of processor instances, no Web API or web anything is necessary. If it's the Processing Worker (in the queue model) or the Processing Server (in the integrated model), that component can just control its internal cache references via Python API.

I am still not sure how should the .cache_clear() or changing the .cache_info().maxsize = ... should be handled by an already running processing worker.

Ideally, it enforces some (preconfigured) max-workers size, and updates the cache whenever instances crash. For GPU workers, the memory consumption must also be watched – but by adding the envvar, you already gave the admin some control.

The #972 should now be ready to be merged.

Agreed.

@kba
Copy link
Member

kba commented Feb 15, 2023

CI failure is only for macosx, merging.

@kba kba merged commit 456f040 into master Feb 15, 2023
@kba kba deleted the ref-processor-helper branch February 15, 2023 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants