Fix: trigger workload requeue on ElasticJob scale-down #6395

ichekrygin · 2025-08-04T06:42:05Z

Ensure that scaling down an ElasticJob updates the reclaimablePods field in the workload status, allowing the scheduler to detect freed capacity and requeue pending workloads.

This resolves a bug where pending jobs remained unadmitted even after capacity was released by a scale-down operation.

What type of PR is this?

/kind bug

What this PR does / why we need it:

This resolves a bug where pending jobs remained unadmitted even after capacity was released by a scale-down operation.

Which issue(s) this PR fixes:

Fixes #6384

Special notes for your reviewer:

Does this PR introduce a user-facing change?

ElasticJobs: Fix the bug that scheduling of the Pending workloads was not triggered on scale-down of the running 
elastic Job which could result in admitting one or more of the queued workloads.

Ensure that scaling down an ElasticJob updates the reclaimablePods field in the workload status, allowing the scheduler to detect freed capacity and requeue pending workloads. This resolves a bug where pending jobs remained unadmitted even after capacity was released by a scale-down operation. Signed-off-by: ichekrygin <[email protected]>

netlify · 2025-08-04T06:42:11Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`905580f`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68905b717aa54700080b088b

mimowo · 2025-08-04T06:44:37Z

test/integration/singlecluster/controller/jobs/job/job_controller_test.go

+				gomega.Eventually(func(g gomega.Gomega) {
+					testJobA.Spec.Parallelism = ptr.To(int32(1))
+					g.Expect(k8sClient.Update(ctx, testJobA)).Should(gomega.Succeed())
+				}).Should(gomega.Succeed())


Please add some timeout, I also noticed they are missing in the old tests, so you could specify them too

mimowo · 2025-08-04T07:04:02Z

pkg/controller/core/workload_controller.go

 		}
-	case prevStatus == workload.StatusAdmitted && status == workload.StatusAdmitted && !equality.Semantic.DeepEqual(e.ObjectOld.Status.ReclaimablePods, e.ObjectNew.Status.ReclaimablePods):
+	case prevStatus == workload.StatusAdmitted && status == workload.StatusAdmitted && !equality.Semantic.DeepEqual(e.ObjectOld.Status.ReclaimablePods, e.ObjectNew.Status.ReclaimablePods),
+		features.Enabled(features.ElasticJobsViaWorkloadSlices) && workloadslicing.ScaledDown(workload.ExtractPodSetCountsFromWorkload(e.ObjectOld), workload.ExtractPodSetCountsFromWorkload(e.ObjectNew)):


This looks good, but let me use this as an occasion to get understanding on how scale down works.

When a user requests scale down:

Do we release some quota during the scaledown before the pods are actually deleted, resulting in temporarily pods running over quota?

Do we keep more quota than needed until the scale up is finished?

IIUC ideally we would gradually release the quota (as the pods terminate and are accounted in reclaimablePods) until the scale down is finished. Once finished we cleanup reclaimablePods and update the podsets count, so that the scaled down workload looks as if it was created new. This workload property would make debugging easier.

Let me know if I'm potentially missing something. If you agree, but assess this is more work I can lgtm / approve this one, so that we fix one issue at the time, let me know.

Currently, batch/v1.Job is the only framework that supports the ElasticJobsViaWorkloadSlices feature.
A scale-down is triggered when the user (or automation) updates the Job's spec.parallelism to a lower value. Kueue’s handling of this event is entirely reactionary. Once the spec.parallelism field is decreased, the Kubernetes Job controller proceeds to complete and remove the excess pods. Kueue, in parallel, observes the change and updates the associated Workload’s spec.podSets[].count to reflect the new, lower value.

Since both the K8s Job controller and the Kueue controller respond independently to the same job change, the pod removal and quota release happen concurrently. In practice, pod deletion often completes slightly before the Workload update and subsequent quota release. However, there are no guardrails to enforce strict ordering, nor is there coordination between the two controllers to ensure that quota is only released after the pods are gone.

By the same token, there's no concept of gradual scale-down in this context. When a Job’s parallelism is reduced, for example, from 10 to 2, the Kubernetes Job controller will proceed to delete the 8 excess pods almost immediately. The deletions happen as fast as the controller and API server allow, and there's no stepwise or progressive reduction. So in practice, the scale-down is effectively instantaneous from the controller's point of view.

I see, actually the reclaimablePods are only decremented when the terminated pods are Succeeded. We should not make the assumption for pods deleted due to scale down. They may end in 0 or non zero exit code. So, I think it is correct not to rely on reclaimablePods.

I see this approach allows for pods running at the same beyond quota for a brief time. Since this is temporary only, I'm not very concerned about it. It would be ideal to release quota gradually, but it might be hard in practice, so we can leave it as a follow up feature on its own.

Signed-off-by: ichekrygin <[email protected]>

mimowo · 2025-08-04T07:58:38Z

/lgtm
/approve
/cherrypick release-0.13
Thanks 👍

k8s-infra-cherrypick-robot · 2025-08-04T07:58:41Z

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.13 in a new PR and assign it to you.

In response to this:

/lgtm
/approve
/cherrypick release-0.13
Thanks 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-08-04T07:58:46Z

LGTM label has been added.

Git tree hash: c4ab65448e17b899628541389b1620ac857a3e3d

k8s-ci-robot · 2025-08-04T07:58:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ichekrygin, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-infra-cherrypick-robot · 2025-08-04T08:38:25Z

@mimowo: new pull request created: #6407

In response to this:

/lgtm
/approve
/cherrypick release-0.13
Thanks 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mimowo · 2025-08-04T08:42:07Z

Proposal, ptal
/release-note-edit

ElasticJobs: Fix the bug that scheduling of the Pending workloads was not triggered on scale-down of the running 
elastic Job which could result in admitting one or more of the queued workloads.

…gs#6395) * Fix: trigger workload requeue on ElasticJob scale-down Ensure that scaling down an ElasticJob updates the reclaimablePods field in the workload status, allowing the scheduler to detect freed capacity and requeue pending workloads. This resolves a bug where pending jobs remained unadmitted even after capacity was released by a scale-down operation. Signed-off-by: ichekrygin <[email protected]> * Add missing "Eventually" timeouts and retry intervals. Signed-off-by: ichekrygin <[email protected]> --------- Signed-off-by: ichekrygin <[email protected]>

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 4, 2025

k8s-ci-robot requested review from kannon92 and mbobrovskyi August 4, 2025 06:42

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 4, 2025

mimowo reviewed Aug 4, 2025

View reviewed changes

Add missing "Eventually" timeouts and retry intervals.

905580f

Signed-off-by: ichekrygin <[email protected]>

ichekrygin force-pushed the job-scale-down-fix branch from 75b9c3a to 905580f Compare August 4, 2025 07:04

k8s-ci-robot assigned mimowo Aug 4, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 4, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 4, 2025

k8s-ci-robot merged commit f9ff9f9 into kubernetes-sigs:main Aug 4, 2025
22 checks passed

k8s-ci-robot added this to the v0.14 milestone Aug 4, 2025

k8s-infra-cherrypick-robot mentioned this pull request Aug 4, 2025

[release-0.13] Fix: trigger workload requeue on ElasticJob scale-down #6407

Merged

ichekrygin deleted the job-scale-down-fix branch August 4, 2025 15:40

mimowo mentioned this pull request Sep 30, 2025

Release v0.14.0 #6756

Closed

36 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: trigger workload requeue on ElasticJob scale-down #6395

Fix: trigger workload requeue on ElasticJob scale-down #6395

ichekrygin commented Aug 4, 2025 •

edited by k8s-ci-robot

Loading

Uh oh!

netlify bot commented Aug 4, 2025 •

edited

Loading

Uh oh!

mimowo Aug 4, 2025

Uh oh!

mimowo Aug 4, 2025 •

edited

Loading

Uh oh!

ichekrygin Aug 4, 2025

Uh oh!

mimowo Aug 4, 2025 •

edited

Loading

Uh oh!

mimowo commented Aug 4, 2025

Uh oh!

k8s-infra-cherrypick-robot commented Aug 4, 2025

Uh oh!

k8s-ci-robot commented Aug 4, 2025

Uh oh!

k8s-ci-robot commented Aug 4, 2025

Uh oh!

Uh oh!

k8s-infra-cherrypick-robot commented Aug 4, 2025

Uh oh!

mimowo commented Aug 4, 2025

Uh oh!

Uh oh!

Fix: trigger workload requeue on ElasticJob scale-down #6395

Fix: trigger workload requeue on ElasticJob scale-down #6395

Conversation

ichekrygin commented Aug 4, 2025 • edited by k8s-ci-robot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Uh oh!

mimowo Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

mimowo Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ichekrygin Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

mimowo Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimowo commented Aug 4, 2025

Uh oh!

k8s-infra-cherrypick-robot commented Aug 4, 2025

Uh oh!

k8s-ci-robot commented Aug 4, 2025

Uh oh!

k8s-ci-robot commented Aug 4, 2025

Uh oh!

Uh oh!

k8s-infra-cherrypick-robot commented Aug 4, 2025

Uh oh!

mimowo commented Aug 4, 2025

Uh oh!

Uh oh!

ichekrygin commented Aug 4, 2025 •

edited by k8s-ci-robot

Loading

netlify bot commented Aug 4, 2025 •

edited

Loading

mimowo Aug 4, 2025 •

edited

Loading

mimowo Aug 4, 2025 •

edited

Loading