Processor builder outside the run_processor + cached version #972

MehmedGIT · 2023-01-16T15:55:45Z

I have created a separate method to get the desired processor class. Both cached and non-cached types are supported. By default, the non-cached processor is returned to preserve the current state. It's possible to try the cached version by setting the additional flag parameter to True.

ocrd/ocrd/processor/helpers.py

bertsky

A few comments from my side:

So you are caching multiple processor instances, giving them time to parse parameters and load models into memory. I suppose this is all meant for the single-worker model carved out by Triet? I am asking because in multi-threading/processing it will be problematic to "steal" vital runtime objects like Process.workspace and Process.input_file_grp from the existing/cached instances. But in any case IMO you still need some function for cache cleaning. (And for GPU-enabled processors, disallowing caching or setting the cache size at runtime in order to prevent GPURAM OOM might be necessary, too.)

Furthermore, comparing to my own refactoring and networking from 2yrs ago, a couple of differences stick out. (Omitting the networking side:)

your PR offers multiple instances each by their own parameters instead of just one (good!)
my PR also exposes run_api for what in your case would be run_processor(...cached_processor=True); I recommend offering the same interface (i.e. get_processor plus run_api
my PR catches runtime exceptions and allows to address them differently via run_api
my PR encapsulates the CWD changes necessary during processing (storing the old CWD prior to the run, and restoring it afterwards, even in the case of failure); I believe your current draft will fail resolving input files when caching is allowed
my PR also addresses the task_sequence.ProcessorTask API: it tries to find the class and instance from the executable name (this part is naturally messy), then uses the actual Processor instance for parameter validation (and even avoiding --dump-json callouts); not sure you want this, but then clearly the Processor class itself should expose parameter valdiation (currently only in the constructor); also, IMO it is worthwhile on your side already thinking about how to backport the task_sequence.run_tasks functionality (i.e. ocrd process) for processing servers (as a minimalistic/fallback/prototype use-case)

MehmedGIT · 2023-01-16T21:21:52Z

Hi @bertsky. The idea of this PR is to refactor the run_processor method and hide the instantiation of an OCR-D processor behind a separate method. I thought it would be bad to just use the cached version from Triet's PR (#884) and decided to make it flexible - keep the default as it is now but still provide a way to use the cached version.

Currently, it's not possible to get a processor instance outside of run_processor and we need such a method in the network package we are implementing. The ocr-d network to realize the architecture proposed by Triet. Inside the network package we have our ProcessingWorker agent which indeed will have it's own wrappers for run_cli and run_processor, e.g. run_cli_from_worker and run_processor_from_worker, to address the things you mentioned in bullet points.
We are trying to implement this entire networking thing without altering the existing and working code base as much as possible.

Importing the Processor fails no matter from where it is imported.

kba

Except for the caching, which we will need to revisit once the ocrd_network package is refactored, this looks like a relatively small change.

I would propose to create a draft PR for the ocrd_network stuff soon and transfer the points raised by @bertsky to the discussion of that, so we can continue discussing with the context of the full changeset.

kba · 2023-01-17T15:29:37Z

ocrd/ocrd/processor/helpers.py

-        workspace,
+
+    processor = get_processor(
+        # TODO: Warning: processorClass of type Object gets auto casted to Type[Processor]


How do you mean? Is the class "downgraded" to a ocrd.Processor?

The comment is not relevant anymore if no typing is provided for the processorClass. I should have removed it. I was not sure if the auto-casting may potentially be problematic here since my IDE highlighted it as a potential problem.

MehmedGIT · 2023-01-17T15:57:34Z

Regarding the draft PR for the ocrd network package - that will come sometime soon. We are first trying to implement a working prototype before opening it for discussion.

kba · 2023-01-17T16:05:12Z

Regarding the draft PR for the ocrd network package - that will come sometime soon. We are first trying to implement a working prototype before opening it for discussion.

Sure, no problem, I just want to make sure we don't lose the discussion thread in a closed PR.

bertsky · 2023-01-21T14:07:07Z

The idea of this PR is to refactor the run_processor method and hide the instantiation of an OCR-D processor behind a separate method. I thought it would be bad to just use the cached version from Triet's PR (#884) and decided to make it flexible - keep the default as it is now but still provide a way to use the cached version.

ah, sry, should have commented in #884 first. (I thought I could home in from the "outside" ;-)

ok, so here you merely wanted to provide an uncached version out of that, too.

Currently, it's not possible to get a processor instance outside of run_processor and we need such a method in the network package we are implementing.

But you do have get_processor already!

The ocr-d network to realize the architecture proposed by Triet. Inside the network package we have our ProcessingWorker agent which indeed will have it's own wrappers for run_cli and run_processor, e.g. run_cli_from_worker and run_processor_from_worker, to address the things you mentioned in bullet points.

Ok, I think I understand now. I like run_*_from_worker better than run_*_from_api BTW. (To me, API would by default stand for Python API, not Web API.)

So, I can already see most of my above points solved in #884:

CWD switching
job validation prior to actual runtime
exception handling (logging + conditionally re-exposing)

But I am still missing:

resetting CWD in case of failure (in my implementation, resetting CWD was part of the inner finally clause; but in your implementation, the CWD stuff is in run_processor and the exception handling happens in the caller)
some function for (processor instance) cache cleaning. (And for GPU-enabled Processor heirs, disallowing such caching or setting the cache size at runtime in order to prevent GPURAM OOM)

MehmedGIT · 2023-01-21T16:54:36Z

But you do have get_processor already!

If I directly reuse the get_processor from #884 then it may end up (potentially) causing problems (which you already covered some) because of the caching.

But in any case IMO you still need some function for cache cleaning. (And for GPU-enabled processors, disallowing caching or setting the cache size at runtime in order to prevent GPURAM OOM might be necessary, too.)

You said this in your previous comment, didn't you? Exactly, to avoid currently unknown for us scenarios I did not just reuse the already available get_processor directly! Instead, I renamed that method to get_cached_processor and wrapped it in a method get_processor which based on a flag returns the cached/non-cached processor. Then reused that method inside run_processor.

Ok, I think I understand now. I like run_from_worker better than run_from_api BTW. (To me, API would by default stand for Python API, not Web API.)

Yeah, I am usually trying to select names that potentially won't confuse others. For the same reason I like processing worker more than processing server when we are referring to an agent that is in fact just an ocr-d processor which is not exposed to the user directly, but indirectly through a processing broker aka processing server with the new term.

But I am still missing:

I am still not sure whether you completely get the idea behind the PR. It's not supposed to provide a full working extension for run_processor that covers all requirements you have mentioned. The main question is: Does this PR break the core in any way? Of course, assuming that the cached_processor flag of run_processor is never set to True. If yes, let me know what is problematic and missing?

One thing now I see I have missed is to place a documentation comment that says to not set this flag. This will prevent others getting confused and thinking it's safe to be set. And, probably, that's why you are confused a bit. Another potential reason for your confusion is probably the fact that #884 was dropped and won't be merged to core. It's there only for reference to me and Jonas while the ocr-d network package is being implemented.

bertsky · 2023-01-23T12:02:08Z

If I directly reuse the get_processor from #884 then it may end up (potentially) causing problems (which you already covered some) because of the caching.
... to avoid currently unknown for us scenarios I did not just reuse the already available get_processor directly! Instead, I renamed that method to get_cached_processor and wrapped it in a method get_processor which based on a flag returns the cached/non-cached processor. Then reused that method inside run_processor.

I can see that. What I asked for is some additional mechanism which can then be used via Web API for resetting the cache (using get_cached_processor.cache_clear()) and changing the size (using get_cached_processor.cache_info().maxsize = ...).

(Also, perhaps instance_caching would be a better kwarg than cached_processor?)

But I am still missing: ...

I am still not sure whether you completely get the idea behind the PR. It's not supposed to provide a full working extension for run_processor that covers all requirements you have mentioned. The main question is: Does this PR break the core in any way? Of course, assuming that the cached_processor flag of run_processor is never set to True. If yes, let me know what is problematic and missing?

One thing now I see I have missed is to place a documentation comment that says to not set this flag. This will prevent others getting confused and thinking it's safe to be set. And, probably, that's why you are confused a bit. Another potential reason for your confusion is probably the fact that #884 was dropped and won't be merged to core. It's there only for reference to me and Jonas while the ocr-d network package is being implemented.

I am not confused (any more). I have already stated that these latter points are about the direction #884 is going. And since we are already discussing them here, and AFAICS there is no other PR to discuss dev-processing-broker and ocrd-webapi-implementation yet, why not?

MehmedGIT · 2023-02-14T10:19:12Z

(Also, perhaps instance_caching would be a better kwarg than cached_processor?)

Agree, I have changed it.

And since we are already discussing them here, and AFAICS there is no other PR to discuss dev-processing-broker and ocrd-webapi-implementation yet, why not?

We are almost there with the PR for the Processing Server. The current draft PR: #974

I can see that. What I asked for is some additional mechanism which can then be used via Web API for resetting the cache (using get_cached_processor.cache_clear()) and changing the size (using get_cached_processor.cache_info().maxsize = ...).

Providing endpoints for that is not a problem, but isn't this something to be handled internally and not through the Web API? I am still not sure how should the .cache_clear() or changing the .cache_info().maxsize = ... should be handled by an already running processing worker. There may be other complications arising from modifying the cache from the outside. We should further discuss this in #974 once the PR is ready.

I have also found this: an interesting article regarding garbage collection with caching which may produce problems in the future if not properly taken care.

The #972 should now be ready to be merged.

ocrd/ocrd/processor/helpers.py

Co-authored-by: Robert Sachunsky <[email protected]>

bertsky · 2023-02-14T21:45:31Z

What I asked for is some additional mechanism which can then be used via Web API for resetting the cache (using get_cached_processor.cache_clear()) and changing the size (using get_cached_processor.cache_info().maxsize = ...).

Providing endpoints for that is not a problem, but isn't this something to be handled internally and not through the Web API?

I stand corrected. For management of processor instances, no Web API or web anything is necessary. If it's the Processing Worker (in the queue model) or the Processing Server (in the integrated model), that component can just control its internal cache references via Python API.

I am still not sure how should the .cache_clear() or changing the .cache_info().maxsize = ... should be handled by an already running processing worker.

Ideally, it enforces some (preconfigured) max-workers size, and updates the cache whenever instances crash. For GPU workers, the memory consumption must also be watched – but by adding the envvar, you already gave the admin some control.

The #972 should now be ready to be merged.

Agreed.

kba · 2023-02-15T11:41:08Z

CI failure is only for macosx, merging.

MehmedGIT added 3 commits January 16, 2023 16:50

Processor builder outside the run_processor + cached version

f299c2a

Add frozendict to requirements

79ff39d

include ocrd_tool in params

eb8dee5

kba reviewed Jan 16, 2023

View reviewed changes

ocrd/ocrd/processor/helpers.py Outdated Show resolved Hide resolved

MehmedGIT added 3 commits January 16, 2023 17:51

Move freeze_args to ocrd.utils.introspect

152566e

Add typing to get_processor and get_cached_processor

a0fa60c

Fix workspace import

3349328

bertsky reviewed Jan 16, 2023

View reviewed changes

MehmedGIT added 3 commits January 17, 2023 13:02

Remove processor typing

f7f8278

Importing the Processor fails no matter from where it is imported.

import processor from base

b8acdc8

remove processor typing

b4bdff3

kba approved these changes Jan 17, 2023

View reviewed changes

Add warning regarding the cached_processor flag

5d2a3cb

kba marked this pull request as ready for review February 13, 2023 14:22

MehmedGIT added 2 commits February 14, 2023 10:50

kwarg cached_processor -> instance_caching

d5fccda

Env for max processor cache, or 128 by default

76a6b02

bertsky reviewed Feb 14, 2023

View reviewed changes

ocrd/ocrd/processor/helpers.py Outdated Show resolved Hide resolved

replace missed: cached_processor -> instance_caching

8004620

Co-authored-by: Robert Sachunsky <[email protected]>

bertsky approved these changes Feb 14, 2023

View reviewed changes

kba merged commit 456f040 into master Feb 15, 2023

kba deleted the ref-processor-helper branch February 15, 2023 11:44

MehmedGIT mentioned this pull request Feb 17, 2023

Fix the processor instance caching #987

Merged

bertsky mentioned this pull request Mar 18, 2023

run_processor / get_processor: do not pass (empty) ocrd_tool #1009

Merged

Processor builder outside the run_processor + cached version #972

Processor builder outside the run_processor + cached version #972

Uh oh!

Conversation

MehmedGIT commented Jan 16, 2023

Uh oh!

Uh oh!

bertsky left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MehmedGIT commented Jan 16, 2023

Uh oh!

kba left a comment

Choose a reason for hiding this comment

Uh oh!

kba Jan 17, 2023

Choose a reason for hiding this comment

Uh oh!

MehmedGIT Jan 17, 2023

Choose a reason for hiding this comment

Uh oh!

MehmedGIT commented Jan 17, 2023

Uh oh!

kba commented Jan 17, 2023

Uh oh!

bertsky commented Jan 21, 2023

Uh oh!

MehmedGIT commented Jan 21, 2023

Uh oh!

bertsky commented Jan 23, 2023

Uh oh!

MehmedGIT commented Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bertsky commented Feb 14, 2023

Uh oh!

kba commented Feb 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bertsky left a comment •

edited

Loading

MehmedGIT commented Feb 14, 2023 •

edited

Loading