[CELEBORN-2344] Support to force set serving state via HTTP APIs#3710
[CELEBORN-2344] Support to force set serving state via HTTP APIs#3710s0nskar wants to merge 1 commit into
Conversation
|
@s0nskar, could you also add a cli for |
RexXiong
left a comment
There was a problem hiding this comment.
The use case described (GCP live migration) can already be covered by the existing DecommissionThenIdle + Recommission workflow via the Master's /sendWorkerEvent API:
# Before migration: stop accepting new shuffle slots, drain existing ones
curl -X POST "http://<master>:<port>/sendWorkerEvent" \
-d "type=DecommissionThenIdle" \
-d "workers=<host>:<rpcPort>:<pushPort>:<fetchPort>:<replicatePort>"
# Check if all shuffles have been drained
curl http://<worker>:<port>/isDecommissioning
# ... perform live migration ...
# After migration: bring worker back
curl -X POST "http://<master>:<port>/sendWorkerEvent" \
-d "type=Recommission" \
-d "workers=<host>:<rpcPort>:<pushPort>:<fetchPort>:<replicatePort>"State flow: Normal → InDecommissionThenIdle → (all shuffles released) → Idle → (Recommission) → Normal.
This approach already maintains the excluded list on the Master side (no new slots allocated), has a proper state machine with Recommission support, and does not interfere with MemoryManager internals.
Regarding the current implementation, the forced state is injected at the top of currentServingState(), which is consumed by switchServingState() along with its side-effect logic (resume replicate, pinned memory handling, pause timing). This creates several issues:
-
Safety conflict: If real memory exceeds the replicate threshold (
PUSH_AND_REPLICATE_PAUSED) but the forced state isPUSH_PAUSED,switchServingState()will callresumeReplicate()under genuine memory pressure. Conversely, forcingNONE_PAUSEDduring real memory pressure bypasses all protection and risks OOM. -
Unintended eviction:
shouldEvict()checksservingState != NONE_PAUSED. ForcingPUSH_PAUSEDfor maintenance (no actual memory pressure) would trigger unnecessary memory file eviction. -
Stale
isPausedflag:currentServingState()returns early when forced state is active, skipping theisPausedassignment. After the forced state clears, the hysteresis behavior between resume and pause thresholds depends on the staleisPausedvalue.
If a forced serving state API is still desired beyond what DecommissionThenIdle provides, I'd suggest restricting it to only allow making the state more restrictive (i.e., max(forcedState, realState)), never less restrictive, to avoid overriding memory safety mechanisms.
Reviewed with Claude Code
What changes were proposed in this pull request?
Added /servingState which supports GET and POST http methods.
GET will just return the current serving state.
POST can be used to force override the serving state of a worker. It takes two params – state (override for serving state) and timeout (after which overridden state should clear up). If timeout is not present forced state will not clear up, unless someone overrides it to empty state.
handling live migration scenarios and other cases where we don't want the worker to receive new data but still want to keep it running. Or maybe where we want to force unpause the worker.
Why are the changes needed?
This can be used for planned maintenance of worker or cases where worker is degraded or under high load but not having high memory pressure. This can also be used for cases to force resume worker which can be useful in cases like #3696
We are using this specifically during GCP live migration – https://docs.cloud.google.com/compute/docs/instances/live-migration-process
Does this PR resolve a correctness bug?
Does this PR introduce any user-facing change?
How was this patch tested?
Tested in our dev setup