-
Notifications
You must be signed in to change notification settings - Fork 431
Fix: trigger workload requeue on ElasticJob scale-down #6395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: trigger workload requeue on ElasticJob scale-down #6395
Conversation
Ensure that scaling down an ElasticJob updates the reclaimablePods field in the workload status, allowing the scheduler to detect freed capacity and requeue pending workloads. This resolves a bug where pending jobs remained unadmitted even after capacity was released by a scale-down operation. Signed-off-by: ichekrygin <[email protected]>
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
gomega.Eventually(func(g gomega.Gomega) { | ||
testJobA.Spec.Parallelism = ptr.To(int32(1)) | ||
g.Expect(k8sClient.Update(ctx, testJobA)).Should(gomega.Succeed()) | ||
}).Should(gomega.Succeed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some timeout, I also noticed they are missing in the old tests, so you could specify them too
} | ||
case prevStatus == workload.StatusAdmitted && status == workload.StatusAdmitted && !equality.Semantic.DeepEqual(e.ObjectOld.Status.ReclaimablePods, e.ObjectNew.Status.ReclaimablePods): | ||
case prevStatus == workload.StatusAdmitted && status == workload.StatusAdmitted && !equality.Semantic.DeepEqual(e.ObjectOld.Status.ReclaimablePods, e.ObjectNew.Status.ReclaimablePods), | ||
features.Enabled(features.ElasticJobsViaWorkloadSlices) && workloadslicing.ScaledDown(workload.ExtractPodSetCountsFromWorkload(e.ObjectOld), workload.ExtractPodSetCountsFromWorkload(e.ObjectNew)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but let me use this as an occasion to get understanding on how scale down works.
When a user requests scale down:
- Do we release some quota during the scaledown before the pods are actually deleted, resulting in temporarily pods running over quota?
- Do we keep more quota than needed until the scale up is finished?
IIUC ideally we would gradually release the quota (as the pods terminate and are accounted in reclaimablePods) until the scale down is finished. Once finished we cleanup reclaimablePods
and update the podsets count, so that the scaled down workload looks as if it was created new. This workload property would make debugging easier.
Let me know if I'm potentially missing something. If you agree, but assess this is more work I can lgtm / approve this one, so that we fix one issue at the time, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, batch/v1.Job
is the only framework that supports the ElasticJobsViaWorkloadSlices feature.
A scale-down is triggered when the user (or automation) updates the Job's spec.parallelism
to a lower value. Kueue’s handling of this event is entirely reactionary. Once the spec.parallelism
field is decreased, the Kubernetes Job controller proceeds to complete and remove the excess pods. Kueue, in parallel, observes the change and updates the associated Workload’s spec.podSets[].count
to reflect the new, lower value.
Since both the K8s Job controller and the Kueue controller respond independently to the same job change, the pod removal and quota release happen concurrently. In practice, pod deletion often completes slightly before the Workload update and subsequent quota release. However, there are no guardrails to enforce strict ordering, nor is there coordination between the two controllers to ensure that quota is only released after the pods are gone.
By the same token, there's no concept of gradual scale-down in this context. When a Job’s parallelism is reduced, for example, from 10 to 2, the Kubernetes Job controller will proceed to delete the 8
excess pods almost immediately. The deletions happen as fast as the controller and API server allow, and there's no stepwise or progressive reduction. So in practice, the scale-down is effectively instantaneous from the controller's point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, actually the reclaimablePods
are only decremented when the terminated pods are Succeeded. We should not make the assumption for pods deleted due to scale down. They may end in 0 or non zero exit code. So, I think it is correct not to rely on reclaimablePods
.
I see this approach allows for pods running at the same beyond quota for a brief time. Since this is temporary only, I'm not very concerned about it. It would be ideal to release quota gradually, but it might be hard in practice, so we can leave it as a follow up feature on its own.
Signed-off-by: ichekrygin <[email protected]>
75b9c3a
to
905580f
Compare
/lgtm |
@mimowo: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
LGTM label has been added. Git tree hash: c4ab65448e17b899628541389b1620ac857a3e3d
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ichekrygin, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@mimowo: new pull request created: #6407 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Proposal, ptal
|
…gs#6395) * Fix: trigger workload requeue on ElasticJob scale-down Ensure that scaling down an ElasticJob updates the reclaimablePods field in the workload status, allowing the scheduler to detect freed capacity and requeue pending workloads. This resolves a bug where pending jobs remained unadmitted even after capacity was released by a scale-down operation. Signed-off-by: ichekrygin <[email protected]> * Add missing "Eventually" timeouts and retry intervals. Signed-off-by: ichekrygin <[email protected]> --------- Signed-off-by: ichekrygin <[email protected]>
Ensure that scaling down an ElasticJob updates the reclaimablePods field in the workload status, allowing the scheduler to detect freed capacity and requeue pending workloads.
This resolves a bug where pending jobs remained unadmitted even after capacity was released by a scale-down operation.
What type of PR is this?
/kind bug
What this PR does / why we need it:
This resolves a bug where pending jobs remained unadmitted even after capacity was released by a scale-down operation.
Which issue(s) this PR fixes:
Fixes #6384
Special notes for your reviewer:
Does this PR introduce a user-facing change?