-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
What happened?
This problem manifests itself fairly regularly both in CI and github workflows. While the (unknown) root cause very much looks like a single problem, it can manifest itself with almost any test failing when cgroupv2 is in use. Digging out the logs alway leads to a backtrace like the one below. The current suspicion for the culprit is the hardcoded and shared /pod_123.slice/pod_123-456.slice cgroup parent directory, and test running in parallel colliding while setting up (or maybe tearing down) their cgroup directory.
More specifically, the error is suspected to come as one test setting up its cgroup path, configuring the used controllers along the path by writing to cgroup.subtree_control the controllers read out from cgroup.controllers, and another parallel test (maybe one doing its cleanup?) modifying the set of available controllers after the failing test had read the set of controllers but before it managed to write them to .subtree_control. The exact details of why/how the set of controllers to shrink/get removed from a sub-directory is currently not fully understood (by the reporter of this bug), so this is only a suspicion...
klitkey1-mobl c-o-d $ cat localintegration-cgroup-error.log
/sys/fs/cgroup/pod_123.slice/cgroup.subtree_control: no such file or directory
error creating cgroup path /pod_123.slice/pod_123-456.slice/crio-7df459a9035a9216751ccfbb437a8a1f7b8fed808470c45eaa9841d7f8c8480e.scope
github.com/containers/podman/v3/pkg/cgroups.(*CgroupControl).initialize
github.com/containers/podman/[email protected]/pkg/cgroups/cgroups.go:305
github.com/containers/podman/v3/pkg/cgroups.New
github.com/containers/podman/[email protected]/pkg/cgroups/cgroups.go:386
github.com/cri-o/cri-o/internal/config/cgmgr.createSandboxCgroup
github.com/cri-o/cri-o/internal/config/cgmgr/cgmgr.go:150
github.com/cri-o/cri-o/internal/config/cgmgr.(*SystemdManager).CreateSandboxCgroup
github.com/cri-o/cri-o/internal/config/cgmgr/systemd.go:204
github.com/cri-o/cri-o/server.(*Server).runPodSandbox
github.com/cri-o/cri-o/server/sandbox_run_linux.go:902
github.com/cri-o/cri-o/server.(*Server).RunPodSandbox
github.com/cri-o/cri-o/server/sandbox_run.go:68
github.com/cri-o/cri-o/server/cri/v1.(*service).RunPodSandbox
github.com/cri-o/cri-o/server/cri/v1/rpc_run_pod_sandbox.go:12
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_RunPodSandbox_Handler.func1
k8s.io/[email protected]/pkg/apis/runtime/v1/api.pb.go:8893
github.com/cri-o/cri-o/internal/log.UnaryInterceptor.func1
github.com/cri-o/cri-o/internal/log/interceptors.go:56
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
github.com/grpc-ecosystem/[email protected]/chain.go:25
github.com/cri-o/cri-o/server/metrics.UnaryInterceptor.func1
github.com/cri-o/cri-o/server/metrics/interceptors.go:24
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
github.com/grpc-ecosystem/[email protected]/chain.go:25
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
github.com/grpc-ecosystem/[email protected]/chain.go:34
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_RunPodSandbox_Handler
k8s.io/[email protected]/pkg/apis/runtime/v1/api.pb.go:8895
google.golang.org/grpc.(*Server).processUnaryRPC
google.golang.org/[email protected]/server.go:1282
google.golang.org/grpc.(*Server).handleStream
google.golang.org/[email protected]/server.go:1616
google.golang.org/grpc.(*Server).serveStreams.func1.2
google.golang.org/[email protected]/server.go:921
runtime.goexit
runtime/asm_amd64.s:1581
create dropped infra 7df459a9035a9216751ccfbb437a8a1f7b8fed808470c45eaa9841d7f8c8480e cgroup
github.com/cri-o/cri-o/server.(*Server).runPodSandbox
github.com/cri-o/cri-o/server/sandbox_run_linux.go:903
github.com/cri-o/cri-o/server.(*Server).RunPodSandbox
github.com/cri-o/cri-o/server/sandbox_run.go:68
github.com/cri-o/cri-o/server/cri/v1.(*service).RunPodSandbox
github.com/cri-o/cri-o/server/cri/v1/rpc_run_pod_sandbox.go:12
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_RunPodSandbox_Handler.func1
k8s.io/[email protected]/pkg/apis/runtime/v1/api.pb.go:8893
github.com/cri-o/cri-o/internal/log.UnaryInterceptor.func1
github.com/cri-o/cri-o/internal/log/interceptors.go:56
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
github.com/grpc-ecosystem/[email protected]/chain.go:25
github.com/cri-o/cri-o/server/metrics.UnaryInterceptor.func1
github.com/cri-o/cri-o/server/metrics/interceptors.go:24
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
github.com/grpc-ecosystem/[email protected]/chain.go:25
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
github.com/grpc-ecosystem/[email protected]/chain.go:34
k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_RunPodSandbox_Handler
k8s.io/[email protected]/pkg/apis/runtime/v1/api.pb.go:8895
google.golang.org/grpc.(*Server).processUnaryRPC
google.golang.org/[email protected]/server.go:1282
google.golang.org/grpc.(*Server).handleStream
google.golang.org/[email protected]/server.go:1616
google.golang.org/grpc.(*Server).serveStreams.func1.2
google.golang.org/[email protected]/server.go:921
runtime.goexit
runtime/asm_amd64.s:1581" file="[email protected]/chain.go:25" id=b72d9c51-8367-403d-bfbb-a1d59dc41535 name=/runtime.v1.RuntimeService/RunPodSandbox
What did you expect to happen?
Test not to fail occasionally/regularly because of failed cgroup(v2) setup.
How can we reproduce it (as minimally and precisely as possible)?
- Apply this patch:
From 3ccb3609fed4a2cac75255356f66f437cc22e410 Mon Sep 17 00:00:00 2001
From: Krisztian Litkey <[email protected]>
Date: Sun, 27 Feb 2022 10:16:09 +0000
Subject: [PATCH 1/1] test: allow state of failing tests to be kept intact.
Allow tests to be kept intact upon failure for better
debugging and diagnosing. If set, the newly introduced
variable TEST_KEEP_ON_FAILURE variable causes failing
tests to skip their teardown phase, leaving all test
artifacts and the related cri-o instance intact.
Signed-off-by: Krisztian Litkey <[email protected]>
---
test/helpers.bash | 15 ++++++++++-----
test/test_runner.sh | 1 +
2 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/test/helpers.bash b/test/helpers.bash
index b655d09e6..c84420a95 100644
--- a/test/helpers.bash
+++ b/test/helpers.bash
@@ -430,11 +430,16 @@ function cleanup_test() {
cat "$CRIO_LOG"
echo "# --- --- ---"
fi
- cleanup_ctrs
- cleanup_pods
- stop_crio
- cleanup_lvm
- cleanup_testdir
+
+ if [ -z "$TEST_KEEP_ON_FAILURE" ] || [ "${BATS_TEST_COMPLETED:-}" = "1" ]; then
+ cleanup_ctrs
+ cleanup_pods
+ stop_crio
+ cleanup_lvm
+ cleanup_testdir
+ else
+ echo >&3 "* Failed \"$BATS_TEST_DESCRIPTION\", TESTDIR=$TESTDIR, LVM_DEVICE=${LVM_DEVICE:-}"
+ fi
}
function load_apparmor_profile() {
diff --git a/test/test_runner.sh b/test/test_runner.sh
index f8b32831d..3435db3ac 100755
--- a/test/test_runner.sh
+++ b/test/test_runner.sh
@@ -2,6 +2,7 @@
set -e
TEST_USERNS=${TEST_USERNS:-}
+TEST_KEEP_ON_FAILURE=${TEST_KEEP_ON_FAILURE:-}
cd "$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
--
2.34.1
- Run something like this:
[root@e2e-fedora-crio cri-o]# while true; do TEST_KEEP_ON_FAILURE=1 JOBS=4 ./test/test_runner.sh || break; done
- Hope for the best/worst kind of luck, and observe the result.
Anything else we need to know?
No response
CRI-O and Kubernetes version
Details
$ crio --version
crio version 1.24.0
Version: 1.24.0
GitCommit: e04b38169a7de240f69805bb7797cff7ea83ad12
GitTreeState: clean
BuildDate: 2022-02-26T13:17:29Z
GoVersion: go1.17.6
Compiler: gc
Platform: linux/amd64
Linkmode: dynamic
BuildTags: containers_image_openpgp, containers_image_ostree_stub, seccomp, selinux
SeccompEnabled: true
AppArmorEnabled: false
# kubernetes version:
# should not matter, not used in failing tests.OS version
Details
# I think this does not, I *think* I've seen this happening on GH/in CI, too...
$ cat /etc/os-release
NAME="Fedora Linux"
VERSION="35 (Cloud Edition)"
ID=fedora
VERSION_ID=35
VERSION_CODENAME=""
PLATFORM_ID="platform:f35"
PRETTY_NAME="Fedora Linux 35 (Cloud Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:35"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f35/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=35
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=35
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Cloud Edition"
VARIANT_ID=cloud
$ uname -a
5.16.5-200.fc35.x86_64