-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Description
I believe this is a bug in the resource cleanup logic found here:
cri-o/internal/resourcestore/resourcestore.go
Lines 91 to 93 in 0e6266b
| for _, f := range r.cleanupFuncs { | |
| f() | |
| } |
I found this while investigating an issue where CNI resources (specifically, IP addresses) were being leaked on a cluster using CRI-O.
In my logs, I can clearly see that the affected sandboxes are being garbage collected in that loop:
Cleaning up stale resource k8s_hello-world-7b8969fc6d-4zrq2_default_72e5d565-87b2-4d44-8f5a-4c4d5d7df14c_0
However, while executing the cleanupFuncs, an error occurs during the CNI DEL call, meaning the CNI plugin doesn't have an opportunity to release any state associated with the resource. Since the cleanupFuncs don't return errors, the garbage collection has no way to know that the resource hasn't been properly cleaned up, and never retries, resulting in leaked state.
Steps to reproduce the issue:
It's a bit tricky to reproduce. In my environment, I believe I am seeing this due to a combination of resource contention and potentially a bug in the CNI layer. I am still investigating the root cause, so might be able to add more here later.
Describe the results you received:
Leaked CNI state on failed teardown.
Describe the results you expected:
Retry GC on resources that fail garbage collection, no leaked state.
Additional information you deem important (e.g. issue happens only occasionally):
Output of crio --version:
crio version 1.18.2-18.rhaos4.5.git754d46b.el8
Version: 1.18.2-18.rhaos4.5.git754d46b.el8
GoVersion: go1.13.4
Compiler: gc
Platform: linux/amd64
Linkmode: dynamic
Additional environment details (AWS, VirtualBox, physical, etc.):
Seen on a OpenShift cluster.