-
Notifications
You must be signed in to change notification settings - Fork 41.5k
Inject pod cgroup creation and deletion in the Kubelet #29049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5d0416a
to
511cd28
Compare
@Random-Liu @derekwaynecarr This PR is ready for review. |
pkg/kubelet/kubelet.go
Outdated
maxImagesInNodeStatus = 50 | ||
|
||
// podCgroupNamePrefix is the prefix of all pod cgroup names | ||
podCgroupNamePrefix = "pod#" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be a good idea to define the same constant twice in different places. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@vishh @dchen1107 - please confirm this is out of the 1.4 milestone? what's the motivation to do it in 1.4 versus wait for 1.5? who is really going to run with this enabled at this time? |
} | ||
for i := range dirInfo { | ||
if dirInfo[i].IsDir() && strings.HasPrefix(dirInfo[i].Name(), podCgroupNamePrefix) { | ||
podUID := strings.TrimPrefix(dirInfo[i].Name(), podCgroupNamePrefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this go into a utility function of some kind so it can be unit tested?
As of now, this PR is not even ready since the pod level cgroups are |
} | ||
return kl.containerRuntime.KillPod(pod, p, gracePeriodOverride) | ||
// cache the pod cgroup Name for reducing the cpu resource limits of the pod cgroup once the pod is killed | ||
pcm := kl.containerManager.NewPodContainerManager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this logic need to be protected by pod level cgroups being enabled?
@vishh - i want the feature, the pr helps me in my own work, the concern was if the call-points are all vetted based on the presence of the flag. it wasn't clear to me in the current pr if that was the case, but i could have missed something in the mock managers that are returned. I was just surprised to see this still tagged 1.4 milestone since I thought it was out. |
@dubstack - I started a branch relative to this PR that integrates with systemd, will post something in next day or so for you to review. may have more comments based on that effort. |
@derekwaynecarr Got it. I tried to stabilize this PR yesterday and it required debugging individual test failures. Just an FYI! |
GCE e2e build/test passed for commit 64c3e88. |
@derekwaynecarr @vishh This PR needs more work. The tests are flaking when limits are being applied on the pod cgroups. I haven't been able to understand what exactly is going wrong. Will have to investigate further. Besides that some tests are failing with just the pod cgroups enabled, which should be fairly easy to resolve. @derekwaynecarr has raised some points which would need to be addressed aswell. |
Just a heads up on something I found when using this PR, but have not fully tracked yet. --cgroup-root is not defaulting to / so when the experimental flag is enabled, you need to also specify cgroup-root. I will track down in the branch I am working on the reason why... |
@dubstack PR needs rebase |
I plan to get this PR ready to review by EOD 28 Sept. |
@dubstack -- please keep me informed. this is a p0 release blocker for 1.5 and so it needs quick iteration. i may start peeling non-controversial aspects in smaller prs. |
fs.StringVar(&s.SystemCgroups, "system-cgroups", s.SystemCgroups, "Optional absolute name of cgroups in which to place all non-kernel processes that are not already inside a cgroup under `/`. Empty for no container. Rolling back the flag requires a reboot. (Default: \"\").") | ||
|
||
//fs.BoolVar(&s.CgroupsPerQOS, "cgroups-per-qos", s.CgroupsPerQOS, "Enable creation of QoS cgroup hierarchy, if true top level QoS and pod cgroups are created.") | ||
fs.BoolVar(&s.CgroupsPerQOS, "experimental-cgroups-per-qos", s.CgroupsPerQOS, "Enable creation of QoS cgroup hierarchy, if true top level QoS and pod cgroups are created.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just call this cgroups-per-qos
, it needs to be functional in 1.5, and we can state its support level separate from the flag name.
@dubstack -- i am going to carve out parts of this PR into smaller PRs so we can start getting things merged. i will cc you on those prs for awareness. |
ok, i have started cleaning up the delete path in a separate PR. there are a number of issues where container manager details bled up into the kubelet that made this difficult with other cgroup drivers. |
I am closing this PR in favor of the set of PRs I will open shortly. There are a lot of assumptions in this PR that are wrong in the delete path once you plugin in a cgroup driver. |
Automatic merge from submit-queue Unblock iterative development on pod-level cgroups In order to allow forward progress on this feature, it takes the commits from kubernetes#28017 kubernetes#29049 and then it globally disables the flag that allows these features to be exercised in the kubelet. The flag can be re-added to the kubelet when its actually ready. /cc @vishh @dubstack @kubernetes/rh-cluster-infra
This PR is linked to the upstream issue #27204 for introducing pod level cgroups into Kubernetes.
@derekwaynecarr @vishh @Random-Liu PTAL.
I have also documented the reasons behind each design decision
I would like some suggestion/discussion on some comments that I would add in the PR.
Please note that only the second commit is unique to this PR.
This change is