[Feature | Orchestration] Optimizated C call overhead away, update pipeline, optimize CPU transient residency#348
Conversation
Apply `gpu` transformation by default on GPU backend Do NOT use memory pool for CPU Use `DaceExecutable` in orchestration
|
Bringing back to draft: the hashing system operates under the assumption the types given to are the orchestrated code are NDSL's OR trivially hashable. I'll introduce a system that deactivates the hashing if we encounter none of those and warns once of performance degradation |
Done. Ready for review. |
romanc
left a comment
There was a problem hiding this comment.
Nice work! I think your assumptions are sound. Just a couple of nitpicks and questions inline.
| stacklevel=2, | ||
| ) | ||
| self.arguments = None # Flush arguments to force recompute | ||
| self._skip_hash = True # Skip future checks |
There was a problem hiding this comment.
If we call it once with non-hashable stuff and afterwards always hashable, _skip_hash never gets reset, right?
There was a problem hiding this comment.
Correct, you are back to the safe-zone: arguments cached is always None and you go do the marshalling
| if config.get_backend() == "dace:cpu_kfirst": | ||
| passes.extend( | ||
| [ | ||
| CleanUpScheduleTree(), |
There was a problem hiding this comment.
at some point it might make sense to write full pipelines for backends that can get fetched from just the backend, but that is not for now
There was a problem hiding this comment.
Yeah I think one of the things a better Backend concept would be doing is carry it's default optimization !
Description
DaCe orchestration was plagued with a slow integration due to the routine marshalling python object (arrays as arguments but also closure) into C-binding ready pointers for calling into the C library
We originally cached everything at first call but that lead to instability: any re-allocation, different argument made to the same program silently fail and we reverted it.
This PR introduces a proper argument hashing that reduces the overhead to being negligible impact on runtime while keeping stability for changing arguments. The hypothesis goes as follows:
This PR also updates the orchestration pipeline:
dace:cpu_KJI(feat[cartesian]: Layout & Schedule pairing fordace:XGridTools/gt4py#2426)daceauto-optimizer by defaultHow has this been tested?
Unit tests and on the microphysics benchmark conducted with GEOS.
Checklist