Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[BUG] Unexpected replica remains on node after all volumes have been cleaned up and causing unexpected scheduled storage #11177

Description

@yangchiu

Describe the Bug

In daily regression https://ci.longhorn.io/job/public/job/v1.9.x/job/v1.9.x-longhorn-upgrade-tests-sles-amd64/40/, after test case test_data_locality_basic failed, there is an unexpected replica longhorn-testvol-4yxuec-1-r-631a8451 remaining on node ip-10-0-2-145 after all volumes have been cleaned up in the teardown function and causing unexpected scheduled storage:

failed on setup with "AssertionError: Wrong disk(default-disk) storageScheduled status.
Expect=0
Got=524288000
node={'address': '10.42.2.27', 'allowScheduling': True, 'autoEvicting': False, 'conditions': {'HugePagesAvailable': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:33:59Z', 'message': 'HugePages (2Mi) are properly configured', 'reason': '', 'status': 'True'}, 'KernelModulesLoaded': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:02Z', 'message': 'Kernel modules [dm_crypt vfio_pci uio_pci_generic nvme_tcp] are loaded', 'reason': '', 'status': 'True'}, 'MountPropagation': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:32Z', 'message': '', 'reason': '', 'status': 'True'}, 'Multipathd': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:02Z', 'message': '', 'reason': '', 'status': 'True'}, 'NFSClientInstalled': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:02Z', 'message': '', 'reason': '', 'status': 'True'}, 'Ready': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-25T01:54:13Z', 'message': 'Kubernetes node ip-10-0-2-145 not ready: NodeStatusUnknown', 'reason': 'KubernetesNodeNotReady', 'status': 'False'}, 'RequiredPackages': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:02Z', 'message': 'All required packages [nfs-client open-iscsi cryptsetup device-mapper] are installed', 'reason': '', 'status': 'True'}, 'Schedulable': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T22:38:11Z', 'message': '', 'reason': '', 'status': 'True'}}, 'disks': {'default-disk': {'allowScheduling': True, 'conditions': {'Ready': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T22:07:29Z', 'message': 'Disk default-disk(/var/lib/longhorn/) on node ip-10-0-2-145 is ready', 'reason': '', 'status': 'True'}, 'Schedulable': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T22:07:29Z', 'message': 'Disk default-disk(/var/lib/longhorn/) on node ip-10-0-2-145 is schedulable', 'reason': '', 'status': 'True'}}, 'diskDriver': '', 'diskType': 'filesystem', 'diskUUID': 'f33f9bda-8706-452a-b483-b99178a84a11', 'evictionRequested': False, 'path': '/var/lib/longhorn/', 'scheduledBackingImage': {}, 'scheduledReplica': {'longhorn-testvol-4yxuec-1-r-631a8451': 524288000}, 'storageAvailable': 27367833600, 'storageMaximum': 42858426368, 'storageReserved': 12857527910, 'storageScheduled': 524288000, 'tags': []}}, 'evictionRequested': False, 'instanceManagerCPURequest': 0, 'name': 'ip-10-0-2-145', 'region': '', 'tags': [], 'zone': ''}
volumes={'createTypes': {'volume': 'http://10.42.4.25:9500/v1/volumes'}, 'data': [], 'resourceType': 'volume'}"

This causes the assertion to fail in the test case teardown function, since it expects all volumes/replicas have been cleaned up, so the storageScheduled should be 0, but it's 524288000, which is consumed by the unexpected scheduled replica.

And somehow this replica can't be deleted across every test case's teardown function and leads to cascading failures. Does it because it doesn't belong to any volumes? Or is it an orphaned replica? Or could there be another factor preventing it from being deleted? I don't have an idea for now.

There are some errors in the longhorn-manager:

2025-06-25T01:53:12.331186105Z time="2025-06-25T01:53:12Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"longhorn-testvol-4yxuec-1-e-0\", UID:\"5b266ebe-94fe-41ef-860b-f75df58f3ecb\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"225292\", FieldPath:\"\"}): type: 'Normal' reason: 'Rebuilt' Detected replica longhorn-testvol-4yxuec-1-r-631a8451 (10.42.2.152:11169) has been rebuilt" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
2025-06-25T01:53:12.383045182Z time="2025-06-25T01:53:12Z" level=warning msg="Failed to check if replica longhorn-testvol-4yxuec-1-r-631a8451 transitioned to mode RW after it was last healthy" func="controller.(*VolumeController).ReconcileEngineReplicaState" file="volume_controller.go:752" accessMode=rwo controller=longhorn-volume currentEngine=longhorn-testvol-4yxuec-1-e-0 error="cannot parse timestamp : parsing time \"\" as \"2006-01-02T15:04:05Z07:00\": cannot parse \"\" as \"2006\"" frontend=blockdev migratable=false node=ip-10-0-2-145 owner=ip-10-0-2-145 state=attached volume=longhorn-testvol-4yxuec-1

And some errors in the node ip-10-0-2-145:

2025-06-25T08:47:37.250961+00:00 ip-10-0-2-145 k3s[1450]: I0625 08:47:37.250860    1450 reconciler_common.go:159] "operationExecutor.UnmountVolume started for volume \"longhorn-testvol-4yxuec-1-pv\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^longhorn-testvol-4yxuec-1\") pod \"783db0f8-7a28-44b4-8797-597115cb0363\" (UID: \"783db0f8-7a28-44b4-8797-597115cb0363\") "
2025-06-25T08:47:37.251082+00:00 ip-10-0-2-145 k3s[1450]: E0625 08:47:37.250987    1450 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^longhorn-testvol-4yxuec-1 podName:783db0f8-7a28-44b4-8797-597115cb0363 nodeName:}" failed. No retries permitted until 2025-06-25 08:47:39.250952096 +0000 UTC m=+7.702572415 (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "longhorn-testvol-4yxuec-1-pv" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^longhorn-testvol-4yxuec-1") pod "783db0f8-7a28-44b4-8797-597115cb0363" (UID: "783db0f8-7a28-44b4-8797-597115cb0363") : kubernetes.io/csi: Unmounter.TearDownAt failed to get CSI client: driver name driver.longhorn.io not found in the list of registered CSI drivers

And some errors in the longhorn-csi-plugin:

2025-06-25T08:47:39.374266802Z time="2025-06-25T08:47:39Z" level=info msg="Trying to clean up mount point /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/fc168963fb976a0a88a608b6770aeb3563bd07fed42fe8562fb4b8a0bb6c5c50/globalmount/longhorn-testvol-4yxuec-1" func=csi.unmountAndCleanupMountPoint file="util.go:383"
2025-06-25T08:47:39.374283353Z W0625 08:47:39.374222       1 mount_helper_common.go:34] Warning: mount cleanup skipped because path does not exist: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/fc168963fb976a0a88a608b6770aeb3563bd07fed42fe8562fb4b8a0bb6c5c50/globalmount/longhorn-testvol-4yxuec-1

Need more help to check if there is anything suspicious in the logs.

To Reproduce

No response

Expected Behavior

No unexpected replica remains on nodes.

Support Bundle for Troubleshooting

v1.9.x-longhorn-upgrade-tests-sles-amd64-40-bundle.zip

Environment

  • Longhorn version: v1.9.x-head
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version:
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

#5120 #6389

Workaround and Mitigation

No response

Metadata

Metadata

Labels

area/volume-replica-schedulingVolume replica scheduling relatedkind/bugpriority/2Nice to implement or fix in this release (managed by PO)reproduce/rare< 50% reproduciblerequire/backportRequire backport. Only used when the specific versions to backport have not been definied.require/qa-reproduceRequire QA to reproduce, especially for issues reported from communityrequire/qa-review-coverageRequire QA to review coverageseverity/3Function working but has a major or UI issue w/ workaround

Type

No type
No fields configured for issues without a type.

Projects

Status
Closed

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions