Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

utam0k
Copy link
Member

@utam0k utam0k commented Aug 13, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR adds filtering for Node update events in the TAS (Topology-Aware Scheduling) controller to skip reconciliation when only LastHeartbeatTime is updated.

Previously, the TAS controller would trigger reconciliation for every Node update, including the periodic heartbeat updates.

Which issue(s) this PR fixes:

Ref: #6551

Special notes for your reviewer:

Does this PR introduce a user-facing change?

TAS: Fix a bug where new Workloads starve, caused by inadmissible workloads frequently requeueing due to unrelated Node LastHeartbeatTime update events.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Aug 13, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 13, 2025
Copy link

netlify bot commented Aug 13, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 04fa579
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68a70e6829125600088154b0

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 13, 2025
@utam0k utam0k force-pushed the tas-filter-events branch from 53831d4 to 250b44d Compare August 13, 2025 12:57
@amy
Copy link
Contributor

amy commented Aug 16, 2025

/assign

// checkNodeSchedulingPropertiesChanged checks if the node update affects TAS scheduling.
func checkNodeSchedulingPropertiesChanged(oldNode, newNode *corev1.Node) bool {
oldCopy := oldNode.DeepCopy()
newCopy := newNode.DeepCopy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need to deepcopy the entire node? Wondering if there's a subset of the node that scheduling cares about.

@amy
Copy link
Contributor

amy commented Aug 16, 2025

Added a nit here: #6570 (comment)
Approving because its a nit. Will let approvers assess the nit if my assessment is wrong. Thank you!

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 16, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 2f8a14316703c502dc7c05e70f3ab3e332300fbc

@amy
Copy link
Contributor

amy commented Aug 18, 2025

/assign @tenzen-y

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this.

Comment on lines 84 to 89
newNode: func() *corev1.Node {
n := baseNode.DeepCopy()
n.Status.Conditions[0].LastHeartbeatTime = later
n.Status.Conditions[1].LastHeartbeatTime = later
return n
}(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of anonymous functions, we can clone the base node and chain new parameters.

@tenzen-y
Copy link
Member

/release-note-edit

TAS: Fix a bug where new Workloads starve, caused by inadmissible workloads frequently requeueing due to unrelated Node LastHeartbeatTime update events.

This was referenced Aug 20, 2025
@utam0k utam0k force-pushed the tas-filter-events branch from 250b44d to 362d96e Compare August 21, 2025 02:35
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 21, 2025
@k8s-ci-robot k8s-ci-robot requested review from amy and tenzen-y August 21, 2025 02:35
},
// Handle resource.Quantity comparison to avoid panic on unexported fields
func(a, b resource.Quantity) bool {
return a.Cmp(b) == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return a.Cmp(b) == 0
return a.Equal(b)

We have Equal function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@utam0k This is still not fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏


for name, tc := range testCases {
t.Run(name, func(t *testing.T) {
got := checkNodeSchedulingPropertiesChanged(tc.oldNode, tc.newNode)
Copy link
Member

@tenzen-y tenzen-y Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I got it. Ideally, we want to call Update function here.
But, we need to implement the fake workqueue RateLimiter for that.

So, I'm ok with testing for checkNodeSchedulingPropertiesChanged since I think that implementing fake workqueue is extra work this PR PoV.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I create the issue before creating the extra PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can create an issue, I would appreciate that.

"Node Ready status changed": {
oldNode: baseNode.Clone().Obj(),
newNode: baseNode.Clone().
ConditionStatus(corev1.NodeReady, corev1.ConditionFalse, later).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ConditionStatus(corev1.NodeReady, corev1.ConditionFalse, later).
StatusConditions(
corev1.NodeCondition{
Type: corev1.NodeReady,
Status: corev1.ConditionFalse,
LastHeartbeatTime: later,
LastTransitionTime: later,
},
).

We want to remove StatusConditions call from baseNode, then call that in every test case.
We follow the DUMP principle to make tests as descriptive as possible.

Note: The ConditionHeartbeat is still better since it overwrites only one field and is less programmable.

Comment on lines 48 to 59
corev1.NodeCondition{
Type: corev1.NodeReady,
Status: corev1.ConditionTrue,
LastHeartbeatTime: now,
LastTransitionTime: now,
},
corev1.NodeCondition{
Type: corev1.NodeMemoryPressure,
Status: corev1.ConditionFalse,
LastHeartbeatTime: now,
LastTransitionTime: now,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need multiple conditions?

TimeAdded: &later,
}).Obj(),
wantChanged: true,
},
Copy link
Member

@tenzen-y tenzen-y Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I want to add a case where oldNode has a null TimeAdd and newNode has a proper TimeAdd.

@utam0k utam0k force-pushed the tas-filter-events branch from 362d96e to 985ea92 Compare August 21, 2025 12:00
Comment on lines 131 to 141
// ConditionStatus updates the Status and LastTransitionTime of an existing condition.
func (n *NodeWrapper) ConditionStatus(conditionType corev1.NodeConditionType, status corev1.ConditionStatus, transitionTime metav1.Time) *NodeWrapper {
for i := range n.Status.Conditions {
if n.Status.Conditions[i].Type == conditionType {
n.Status.Conditions[i].Status = status
n.Status.Conditions[i].LastTransitionTime = transitionTime
break
}
}
return n
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ConditionStatus updates the Status and LastTransitionTime of an existing condition.
func (n *NodeWrapper) ConditionStatus(conditionType corev1.NodeConditionType, status corev1.ConditionStatus, transitionTime metav1.Time) *NodeWrapper {
for i := range n.Status.Conditions {
if n.Status.Conditions[i].Type == conditionType {
n.Status.Conditions[i].Status = status
n.Status.Conditions[i].LastTransitionTime = transitionTime
break
}
}
return n
}

This is no longer used anywhere.

@tenzen-y
Copy link
Member

@utam0k Remaining unresolved tasks are #6570 (comment) and #6570 (comment).
Otherwise, LGTM

@utam0k utam0k force-pushed the tas-filter-events branch from 985ea92 to 04fa579 Compare August 21, 2025 12:17
@utam0k
Copy link
Member Author

utam0k commented Aug 21, 2025

@utam0k Remaining unresolved tasks are #6570 (comment) and #6570 (comment). Otherwise, LGTM

Thanks for your quick review 🙏

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 21, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 855c723ab1cd9737637a48024c0f32c0ed1a71b3

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amy, tenzen-y, utam0k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2025
@tenzen-y
Copy link
Member

/cherry-pick release-0.12
/cherry-pick release-0.13

@k8s-infra-cherrypick-robot
Copy link
Contributor

@tenzen-y: once the present PR merges, I will cherry-pick it on top of release-0.12, release-0.13 in new PRs and assign them to you.

In response to this:

/cherry-pick release-0.12
/cherry-pick release-0.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot merged commit 44acfa4 into kubernetes-sigs:main Aug 21, 2025
22 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.14 milestone Aug 21, 2025
@k8s-infra-cherrypick-robot
Copy link
Contributor

@tenzen-y: new pull request created: #6636

In response to this:

/cherry-pick release-0.12
/cherry-pick release-0.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Contributor

@tenzen-y: new pull request created: #6637

In response to this:

/cherry-pick release-0.12
/cherry-pick release-0.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants