Describe the Bug
In daily regression https://ci.longhorn.io/job/public/job/v1.9.x/job/v1.9.x-longhorn-upgrade-tests-sles-amd64/40/, after test case test_data_locality_basic failed, there is an unexpected replica longhorn-testvol-4yxuec-1-r-631a8451 remaining on node ip-10-0-2-145 after all volumes have been cleaned up in the teardown function and causing unexpected scheduled storage:
failed on setup with "AssertionError: Wrong disk(default-disk) storageScheduled status.
Expect=0
Got=524288000
node={'address': '10.42.2.27', 'allowScheduling': True, 'autoEvicting': False, 'conditions': {'HugePagesAvailable': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:33:59Z', 'message': 'HugePages (2Mi) are properly configured', 'reason': '', 'status': 'True'}, 'KernelModulesLoaded': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:02Z', 'message': 'Kernel modules [dm_crypt vfio_pci uio_pci_generic nvme_tcp] are loaded', 'reason': '', 'status': 'True'}, 'MountPropagation': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:32Z', 'message': '', 'reason': '', 'status': 'True'}, 'Multipathd': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:02Z', 'message': '', 'reason': '', 'status': 'True'}, 'NFSClientInstalled': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:02Z', 'message': '', 'reason': '', 'status': 'True'}, 'Ready': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-25T01:54:13Z', 'message': 'Kubernetes node ip-10-0-2-145 not ready: NodeStatusUnknown', 'reason': 'KubernetesNodeNotReady', 'status': 'False'}, 'RequiredPackages': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:02Z', 'message': 'All required packages [nfs-client open-iscsi cryptsetup device-mapper] are installed', 'reason': '', 'status': 'True'}, 'Schedulable': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T22:38:11Z', 'message': '', 'reason': '', 'status': 'True'}}, 'disks': {'default-disk': {'allowScheduling': True, 'conditions': {'Ready': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T22:07:29Z', 'message': 'Disk default-disk(/var/lib/longhorn/) on node ip-10-0-2-145 is ready', 'reason': '', 'status': 'True'}, 'Schedulable': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T22:07:29Z', 'message': 'Disk default-disk(/var/lib/longhorn/) on node ip-10-0-2-145 is schedulable', 'reason': '', 'status': 'True'}}, 'diskDriver': '', 'diskType': 'filesystem', 'diskUUID': 'f33f9bda-8706-452a-b483-b99178a84a11', 'evictionRequested': False, 'path': '/var/lib/longhorn/', 'scheduledBackingImage': {}, 'scheduledReplica': {'longhorn-testvol-4yxuec-1-r-631a8451': 524288000}, 'storageAvailable': 27367833600, 'storageMaximum': 42858426368, 'storageReserved': 12857527910, 'storageScheduled': 524288000, 'tags': []}}, 'evictionRequested': False, 'instanceManagerCPURequest': 0, 'name': 'ip-10-0-2-145', 'region': '', 'tags': [], 'zone': ''}
volumes={'createTypes': {'volume': 'http://10.42.4.25:9500/v1/volumes'}, 'data': [], 'resourceType': 'volume'}"
This causes the assertion to fail in the test case teardown function, since it expects all volumes/replicas have been cleaned up, so the storageScheduled should be 0, but it's 524288000, which is consumed by the unexpected scheduled replica.
And somehow this replica can't be deleted across every test case's teardown function and leads to cascading failures. Does it because it doesn't belong to any volumes? Or is it an orphaned replica? Or could there be another factor preventing it from being deleted? I don't have an idea for now.
There are some errors in the longhorn-manager:
2025-06-25T01:53:12.331186105Z time="2025-06-25T01:53:12Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"longhorn-testvol-4yxuec-1-e-0\", UID:\"5b266ebe-94fe-41ef-860b-f75df58f3ecb\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"225292\", FieldPath:\"\"}): type: 'Normal' reason: 'Rebuilt' Detected replica longhorn-testvol-4yxuec-1-r-631a8451 (10.42.2.152:11169) has been rebuilt" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
2025-06-25T01:53:12.383045182Z time="2025-06-25T01:53:12Z" level=warning msg="Failed to check if replica longhorn-testvol-4yxuec-1-r-631a8451 transitioned to mode RW after it was last healthy" func="controller.(*VolumeController).ReconcileEngineReplicaState" file="volume_controller.go:752" accessMode=rwo controller=longhorn-volume currentEngine=longhorn-testvol-4yxuec-1-e-0 error="cannot parse timestamp : parsing time \"\" as \"2006-01-02T15:04:05Z07:00\": cannot parse \"\" as \"2006\"" frontend=blockdev migratable=false node=ip-10-0-2-145 owner=ip-10-0-2-145 state=attached volume=longhorn-testvol-4yxuec-1
And some errors in the node ip-10-0-2-145:
2025-06-25T08:47:37.250961+00:00 ip-10-0-2-145 k3s[1450]: I0625 08:47:37.250860 1450 reconciler_common.go:159] "operationExecutor.UnmountVolume started for volume \"longhorn-testvol-4yxuec-1-pv\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^longhorn-testvol-4yxuec-1\") pod \"783db0f8-7a28-44b4-8797-597115cb0363\" (UID: \"783db0f8-7a28-44b4-8797-597115cb0363\") "
2025-06-25T08:47:37.251082+00:00 ip-10-0-2-145 k3s[1450]: E0625 08:47:37.250987 1450 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^longhorn-testvol-4yxuec-1 podName:783db0f8-7a28-44b4-8797-597115cb0363 nodeName:}" failed. No retries permitted until 2025-06-25 08:47:39.250952096 +0000 UTC m=+7.702572415 (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "longhorn-testvol-4yxuec-1-pv" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^longhorn-testvol-4yxuec-1") pod "783db0f8-7a28-44b4-8797-597115cb0363" (UID: "783db0f8-7a28-44b4-8797-597115cb0363") : kubernetes.io/csi: Unmounter.TearDownAt failed to get CSI client: driver name driver.longhorn.io not found in the list of registered CSI drivers
And some errors in the longhorn-csi-plugin:
2025-06-25T08:47:39.374266802Z time="2025-06-25T08:47:39Z" level=info msg="Trying to clean up mount point /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/fc168963fb976a0a88a608b6770aeb3563bd07fed42fe8562fb4b8a0bb6c5c50/globalmount/longhorn-testvol-4yxuec-1" func=csi.unmountAndCleanupMountPoint file="util.go:383"
2025-06-25T08:47:39.374283353Z W0625 08:47:39.374222 1 mount_helper_common.go:34] Warning: mount cleanup skipped because path does not exist: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/fc168963fb976a0a88a608b6770aeb3563bd07fed42fe8562fb4b8a0bb6c5c50/globalmount/longhorn-testvol-4yxuec-1
Need more help to check if there is anything suspicious in the logs.
To Reproduce
No response
Expected Behavior
No unexpected replica remains on nodes.
Support Bundle for Troubleshooting
v1.9.x-longhorn-upgrade-tests-sles-amd64-40-bundle.zip
Environment
- Longhorn version: v1.9.x-head
- Impacted volume (PV):
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
- Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
Additional context
#5120 #6389
Workaround and Mitigation
No response
Describe the Bug
In daily regression https://ci.longhorn.io/job/public/job/v1.9.x/job/v1.9.x-longhorn-upgrade-tests-sles-amd64/40/, after test case
test_data_locality_basicfailed, there is an unexpected replicalonghorn-testvol-4yxuec-1-r-631a8451remaining on nodeip-10-0-2-145after all volumes have been cleaned up in the teardown function and causing unexpected scheduled storage:This causes the assertion to fail in the test case teardown function, since it expects all volumes/replicas have been cleaned up, so the
storageScheduledshould be0, but it's524288000, which is consumed by the unexpected scheduled replica.And somehow this replica can't be deleted across every test case's teardown function and leads to cascading failures. Does it because it doesn't belong to any volumes? Or is it an orphaned replica? Or could there be another factor preventing it from being deleted? I don't have an idea for now.
There are some errors in the
longhorn-manager:And some errors in the node
ip-10-0-2-145:And some errors in the
longhorn-csi-plugin:Need more help to check if there is anything suspicious in the logs.
To Reproduce
No response
Expected Behavior
No unexpected replica remains on nodes.
Support Bundle for Troubleshooting
v1.9.x-longhorn-upgrade-tests-sles-amd64-40-bundle.zip
Environment
Additional context
#5120 #6389
Workaround and Mitigation
No response