[BUG] Unexpected replica remains on node after all volumes have been cleaned up and causing unexpected scheduled storage

### Describe the Bug

In daily regression https://ci.longhorn.io/job/public/job/v1.9.x/job/v1.9.x-longhorn-upgrade-tests-sles-amd64/40/, after test case `test_data_locality_basic` [failed](https://ci.longhorn.io/job/public/job/v1.9.x/job/v1.9.x-longhorn-upgrade-tests-sles-amd64/40/testReport/junit/tests/test_scheduling/test_data_locality_basic/), there is an unexpected replica `longhorn-testvol-4yxuec-1-r-631a8451` remaining on node `ip-10-0-2-145` after all volumes have been cleaned up in the teardown function and causing unexpected scheduled storage:

```
failed on setup with "AssertionError: Wrong disk(default-disk) storageScheduled status.
Expect=0
Got=524288000
node={'address': '10.42.2.27', 'allowScheduling': True, 'autoEvicting': False, 'conditions': {'HugePagesAvailable': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:33:59Z', 'message': 'HugePages (2Mi) are properly configured', 'reason': '', 'status': 'True'}, 'KernelModulesLoaded': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:02Z', 'message': 'Kernel modules [dm_crypt vfio_pci uio_pci_generic nvme_tcp] are loaded', 'reason': '', 'status': 'True'}, 'MountPropagation': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:32Z', 'message': '', 'reason': '', 'status': 'True'}, 'Multipathd': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:02Z', 'message': '', 'reason': '', 'status': 'True'}, 'NFSClientInstalled': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:02Z', 'message': '', 'reason': '', 'status': 'True'}, 'Ready': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-25T01:54:13Z', 'message': 'Kubernetes node ip-10-0-2-145 not ready: NodeStatusUnknown', 'reason': 'KubernetesNodeNotReady', 'status': 'False'}, 'RequiredPackages': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T11:20:02Z', 'message': 'All required packages [nfs-client open-iscsi cryptsetup device-mapper] are installed', 'reason': '', 'status': 'True'}, 'Schedulable': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T22:38:11Z', 'message': '', 'reason': '', 'status': 'True'}}, 'disks': {'default-disk': {'allowScheduling': True, 'conditions': {'Ready': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T22:07:29Z', 'message': 'Disk default-disk(/var/lib/longhorn/) on node ip-10-0-2-145 is ready', 'reason': '', 'status': 'True'}, 'Schedulable': {'lastProbeTime': '', 'lastTransitionTime': '2025-06-24T22:07:29Z', 'message': 'Disk default-disk(/var/lib/longhorn/) on node ip-10-0-2-145 is schedulable', 'reason': '', 'status': 'True'}}, 'diskDriver': '', 'diskType': 'filesystem', 'diskUUID': 'f33f9bda-8706-452a-b483-b99178a84a11', 'evictionRequested': False, 'path': '/var/lib/longhorn/', 'scheduledBackingImage': {}, 'scheduledReplica': {'longhorn-testvol-4yxuec-1-r-631a8451': 524288000}, 'storageAvailable': 27367833600, 'storageMaximum': 42858426368, 'storageReserved': 12857527910, 'storageScheduled': 524288000, 'tags': []}}, 'evictionRequested': False, 'instanceManagerCPURequest': 0, 'name': 'ip-10-0-2-145', 'region': '', 'tags': [], 'zone': ''}
volumes={'createTypes': {'volume': 'http://10.42.4.25:9500/v1/volumes'}, 'data': [], 'resourceType': 'volume'}"
```

This causes the assertion to fail in the test case teardown function, since it expects all volumes/replicas have been cleaned up, so the `storageScheduled` should be `0`, but it's `524288000`, which is consumed by the unexpected scheduled replica.

And somehow this replica can't be deleted across every test case's teardown function and leads to [cascading failures](https://ci.longhorn.io/job/public/job/v1.9.x/job/v1.9.x-longhorn-upgrade-tests-sles-amd64/40/#showFailuresLink). Does it because it doesn't belong to any volumes? Or is it an orphaned replica? Or could there be another factor preventing it from being deleted? I don't have an idea for now.

There are some errors in the `longhorn-manager`:
```
2025-06-25T01:53:12.331186105Z time="2025-06-25T01:53:12Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"longhorn-testvol-4yxuec-1-e-0\", UID:\"5b266ebe-94fe-41ef-860b-f75df58f3ecb\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"225292\", FieldPath:\"\"}): type: 'Normal' reason: 'Rebuilt' Detected replica longhorn-testvol-4yxuec-1-r-631a8451 (10.42.2.152:11169) has been rebuilt" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
2025-06-25T01:53:12.383045182Z time="2025-06-25T01:53:12Z" level=warning msg="Failed to check if replica longhorn-testvol-4yxuec-1-r-631a8451 transitioned to mode RW after it was last healthy" func="controller.(*VolumeController).ReconcileEngineReplicaState" file="volume_controller.go:752" accessMode=rwo controller=longhorn-volume currentEngine=longhorn-testvol-4yxuec-1-e-0 error="cannot parse timestamp : parsing time \"\" as \"2006-01-02T15:04:05Z07:00\": cannot parse \"\" as \"2006\"" frontend=blockdev migratable=false node=ip-10-0-2-145 owner=ip-10-0-2-145 state=attached volume=longhorn-testvol-4yxuec-1
```

And some errors in the node `ip-10-0-2-145`:
```
2025-06-25T08:47:37.250961+00:00 ip-10-0-2-145 k3s[1450]: I0625 08:47:37.250860    1450 reconciler_common.go:159] "operationExecutor.UnmountVolume started for volume \"longhorn-testvol-4yxuec-1-pv\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^longhorn-testvol-4yxuec-1\") pod \"783db0f8-7a28-44b4-8797-597115cb0363\" (UID: \"783db0f8-7a28-44b4-8797-597115cb0363\") "
2025-06-25T08:47:37.251082+00:00 ip-10-0-2-145 k3s[1450]: E0625 08:47:37.250987    1450 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^longhorn-testvol-4yxuec-1 podName:783db0f8-7a28-44b4-8797-597115cb0363 nodeName:}" failed. No retries permitted until 2025-06-25 08:47:39.250952096 +0000 UTC m=+7.702572415 (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "longhorn-testvol-4yxuec-1-pv" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^longhorn-testvol-4yxuec-1") pod "783db0f8-7a28-44b4-8797-597115cb0363" (UID: "783db0f8-7a28-44b4-8797-597115cb0363") : kubernetes.io/csi: Unmounter.TearDownAt failed to get CSI client: driver name driver.longhorn.io not found in the list of registered CSI drivers
```

And some errors in the `longhorn-csi-plugin`:
```
2025-06-25T08:47:39.374266802Z time="2025-06-25T08:47:39Z" level=info msg="Trying to clean up mount point /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/fc168963fb976a0a88a608b6770aeb3563bd07fed42fe8562fb4b8a0bb6c5c50/globalmount/longhorn-testvol-4yxuec-1" func=csi.unmountAndCleanupMountPoint file="util.go:383"
2025-06-25T08:47:39.374283353Z W0625 08:47:39.374222       1 mount_helper_common.go:34] Warning: mount cleanup skipped because path does not exist: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/fc168963fb976a0a88a608b6770aeb3563bd07fed42fe8562fb4b8a0bb6c5c50/globalmount/longhorn-testvol-4yxuec-1
```

Need more help to check if there is anything suspicious in the logs.




### To Reproduce

_No response_

### Expected Behavior

No unexpected replica remains on nodes.

### Support Bundle for Troubleshooting

[v1.9.x-longhorn-upgrade-tests-sles-amd64-40-bundle.zip](https://github.com/user-attachments/files/20915119/v1.9.x-longhorn-upgrade-tests-sles-amd64-40-bundle.zip)

### Environment

- Longhorn version: v1.9.x-head
- Impacted volume (PV): 
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
  - Number of control plane nodes in the cluster:
  - Number of worker nodes in the cluster:
- Node config
  - OS type and version:
  - Kernel version:
  - CPU per node:
  - Memory per node:
  - Disk type (e.g. SSD/NVMe/HDD):
  - Network bandwidth between the nodes (Gbps):
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:


### Additional context

https://github.com/longhorn/longhorn/issues/5120 https://github.com/longhorn/longhorn/issues/6389

### Workaround and Mitigation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Unexpected replica remains on node after all volumes have been cleaned up and causing unexpected scheduled storage #11177

Describe the Bug

To Reproduce

Expected Behavior

Support Bundle for Troubleshooting

Environment

Additional context

Workaround and Mitigation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Unexpected replica remains on node after all volumes have been cleaned up and causing unexpected scheduled storage #11177

Description

Describe the Bug

To Reproduce

Expected Behavior

Support Bundle for Troubleshooting

Environment

Additional context

Workaround and Mitigation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions