-
Notifications
You must be signed in to change notification settings - Fork 76
Description
Prework
- Read and agree to the code of conduct and contributing guidelines.
- If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
- New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.
- Format your code according to the tidyverse style guide.
Proposal
Since its inception, targets has only ever supported fully transient or fully persistent workers. This has mostly worked until now because initialization, monitoring, and idle time do not have terrible consequences on traditional clusters. But the cloud will be totally different. Requesting jobs, initializing Docker images, and communicating with AWS/GCP will all take a lot more time, which means fully transient workers will be inefficient (see futureverse/future#567). At the other extreme, persistent workers on a nontrivial DAG would spend a wasteful amount of time idling, driving up monetary costs.
There are fancy hybrid approaches such as snakemake job grouping, but this is a lot of manual work for the user, and dyanamic branching makes it hard to ensure proper load balancing in advance.
What I have in mind is similar to what I proposed in mschubert/clustermq#257 (comment):
- Start by submitting an array job of a certain user-specified size, not necessarily the maximum size.
- If more work is requested with $send_call() and the number of currently busy workers is less than the user-specified maximum, then initialize a new worker for the new job.
- If a worker idles for long enough (i.e. receives nothing or only $send_wait() for some length of time) then shut down that worker.
(and of course the implicit step 4 is to close down idle workers when there is no more work to submit).
This is what @mattwarkentin called the "heuristic" approach in mschubert/clustermq#257 (reply in thread) (which trivially reduces to the "deterministic" approach if idle time is infinite). The feature is discussed for clustermq.
To implement this, I picture a separate package to manage a dynamic collection of semitransient workers. The look and feel should be like a clustermq::workers() object: it should support message-passing OOP (R6-like), and it should spend as little time as possible blocking the main process. However, it should sit as a layer higher in the stack than clustermq or future. A subclass of a workers class may rely on clustermq or future in the backend, or maybe something entirely custom. This freedom should open up workarounds to increase efficiency: for some subclasses, maybe transient local background processes to launch and/or poll workers.
I would like to overhaul the workers package for this.