-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Description
Since #4650, kata pods are not reaching the "completed" state. The main change introduced in that PR happens on the kind of context used inside the runtime VM code.
Prior to that patch, the context was created using context.Background(), which "is never canceled, has no values, and has no deadline", as shown here:
return &runtimeVM{
path: path,
fifoDir: filepath.Join(root, "crio", "fifo"),
ctx: context.Background(),
ctrs: make(map[string]containerInfo),
}
With the change done by #4650, the we use a context coming from GRPC server as a parent, which has a deadline linked to it, when we create the context we'll use, as shown here:
func addRequestName(ctx context.Context, req string) context.Context {
return context.WithValue(ctx, Name{}, req)
}
Note 1: I was able to check that the context coming from GRPC has a deadline attached to it when printing it.
This change, simply this change, makes that the following pod never reaches its "Completed" status.
#
# Copyright (c) 2021 Red Hat, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
apiVersion: v1
kind: Pod
metadata:
name: sharevol-kata
spec:
runtimeClassName: kata
restartPolicy: Never
securityContext:
runAsUser: 1001
fsGroup: 123
containers:
- name: mounttest-container
image: "k8s.gcr.io/e2e-test-images/agnhost:2.21"
args:
- mounttest
- --fs_type=/test-volume
- --new_file_0660=/test-volume/test-file
- --file_perm=/test-volume/test-file
- --file_owner=/test-volume/test-file
volumeMounts:
- name: emptydir-volume
mountPath: /test-volume
volumes:
- name: emptydir-volume
emptyDir: {}
Note 2: I was able to verify this by doing this change:
[fidencio@bump cri-o]$ git diff
diff --git a/internal/log/interceptors.go b/internal/log/interceptors.go
index cce141e22..ff1e6b4bc 100644
--- a/internal/log/interceptors.go
+++ b/internal/log/interceptors.go
@@ -74,5 +74,5 @@ func addRequestID(ctx context.Context) context.Context {
}
func addRequestName(ctx context.Context, req string) context.Context {
- return context.WithValue(ctx, Name{}, req)
+ return context.WithValue(context.Background(), Name{}, req)
}
Obviously, the change is not correct, but it serves the purpose to narrow down where the issue comes from.
My theory, based on quite limited knowledge, is that the pod never reach the completed status because the context passed down is never "done", as its deadline was not exceeded neither it was cancelled, leaving us in this weird state.
Steps to reproduce the issue:
- Setup a k8s cluster capable of running kata-containers
- Execute the pod described above
- Check that it doesn't reach the "Completed" state
Describe the results you received:
The pod has its state in "Running"
Describe the results you expected:
The pod should be marked as "Completed" after a few seconds.
Additional information you deem important (e.g. issue happens only occasionally):
This is blocking kata-containers release, as we'd like to do a bump of the kubernetes version we test against, and this means also a bump of the CRI-O version we test against.
Reverting #4650 is an option, but I'd rather get some help from more experienced CRI-O developers to understand whether we can solve this in a different way.
Output of crio --version:
Version: 1.21.0
GitCommit: b99366680904420974463572f6f0b92166437b0e
GitTreeState: clean
BuildDate: 2021-04-23T18:55:05Z
GoVersion: go1.14.12
Compiler: gc
Platform: linux/amd64
Linkmode: dynamic
Additional environment details (AWS, VirtualBox, physical, etc.):