Description
In the current Ray Data / autoscaler behavior, when an upstream operator (op) finishes processing,
its resources are scaled down gradually step-by-step. This makes the scale-down process very slow,
and as a result, downstream ops struggle to quickly acquire resources to process data, leading to
pipeline stalls and underutilization.
Problem
- Upstream op scale-down is very slow (step-wise).
- Downstream ops wait a long time before they can get enough resources.
- Overall pipeline throughput suffers because of this delayed resource handoff.
Proposal
Once an op has fully finished processing its data, release all resources occupied by that op
in a single shot, instead of scaling them down incrementally per step. This would:
- Allow downstream ops to immediately claim freed resources.
- Reduce end-to-end latency of the pipeline.
- Simplify the resource handoff logic between ops.
Question
Is there a specific reason the current design prefers step-wise scale-down over one-shot release
after an op completes? Would the community be open to changing this behavior (or making it
configurable)?
Use case
No response
Description
In the current Ray Data / autoscaler behavior, when an upstream operator (op) finishes processing,
its resources are scaled down gradually step-by-step. This makes the scale-down process very slow,
and as a result, downstream ops struggle to quickly acquire resources to process data, leading to
pipeline stalls and underutilization.
Problem
Proposal
Once an op has fully finished processing its data, release all resources occupied by that op
in a single shot, instead of scaling them down incrementally per step. This would:
Question
Is there a specific reason the current design prefers step-wise scale-down over one-shot release
after an op completes? Would the community be open to changing this behavior (or making it
configurable)?
Use case
No response