Thanks to visit codestin.com
Credit goes to github.com

Skip to content
This repository was archived by the owner on Dec 9, 2025. It is now read-only.

Conversation

@michaelasp
Copy link
Collaborator

Replicates an issue seen with reapplying the same deployment for a resource

@michaelasp
Copy link
Collaborator Author

On a side note, I'm not a huge fan of declaring new devices for each test like this. Should we clean up the dummy devices in between tests?

@gauravkghildiyal
Copy link
Member

Thanks for looking into this @michaelasp

The current test is failing at line 266 which seems to be the first pod creation:

kubectl wait --timeout=30s --for=condition=ready pods -l app=MyApp

Aren't we trying to repro failure after the delete and recreate? Is this perhaps a different failure of our e2e repro?

@michaelasp
Copy link
Collaborator Author

Hmm strange, it didn't fail at that point locally for me. Let me trigger a rerun but it may be due to the fact I commented out other tests for a faster run.

e2e.bats
 ✗ reapply pod with dummy resource claim
   (in test file tests/e2e.bats, line 281)
     `kubectl wait --timeout=30s --for=condition=ready pods -l app=MyApp' failed
   deviceclass.resource.k8s.io/multinic created
   resourceclaimtemplate.resource.k8s.io/phy-interfaces-template created
   deployment.apps/server-deployment created
   pod/server-deployment-5f5bfc84cf-dxncz condition met

@gauravkghildiyal
Copy link
Member

On a side note, I'm not a huge fan of declaring new devices for each test like this. Should we clean up the dummy devices in between tests?

Yes I completely agree. Some of this work is actually called out in #137

@gauravkghildiyal
Copy link
Member

Still failing at 266. I think because we are using the same pod labels as the previous test, that has some lingering impact?

@michaelasp
Copy link
Collaborator Author

Still failing at 266. I think because we are using the same pod labels as the previous test, that has some lingering impact?

I though delete waited for the resources to go away, but to be safe I made it so we used different labels and deployment name just in case that was the issue. I think that we may need to add a bit of timeouts between tests to let things settle or figure out what exactly is causing these intermittent failures.

@michaelasp
Copy link
Collaborator Author

Now it's repro'd

# (in test file tests/e2e.bats, line 281)
#   `kubectl wait --timeout=30s --for=condition=ready pods -l app=reapplyApp' failed

@gauravkghildiyal
Copy link
Member

gauravkghildiyal commented Jun 27, 2025

Thanks for the repro @michaelasp

Checking the logs, the issue here stems because we don't (and most likely cannot properly) recreate the exact dummy interface after it is assigned to the first pod and then moved back to the root namespace.

  1. Our test create a dummy interface and manually assign an IP
  2. Pod 1 is created and dummy interface is moved to its network namespace. No DHCP is done because the interface already had an IP. (i.e. the following is NOT executed)
    if len(podCfg.Network.Interface.Addresses) == 0 {
    klog.V(2).Infof("trying to get network configuration via DHCP")
    ip, routes, err := getDHCP(ifName)
  3. Test deletes Pod 1 and the dummy interface is moved back to the root namespace
  4. Our driver does NOT reassign an IP back to the interface. Since this is a dummy interface, there is no device driver which takes care of re-configuring the device and it remains without an IP
  5. Pod 2 is created claiming this dummy interface.
  6. This time around, because the interface does not have an IP, we try to do a DHCP request which could take a long time. If it finishes before the 45 seconds timeout (even if it fails), then the pod would get scheduled. If however it does not finish within the 45 second timeout, the kubelet could retry again.

While there definitely is scope for improvement in our code, in this sceario I'd say that perhaps this is something which is a byproduct of using a dummy interface and likely our tests may need to take care of re-configuring the address after the dummy interface is recycled from Pod 1.

@aojea
Copy link
Collaborator

aojea commented Jun 28, 2025

The reason why the addresses are not brough back is because it can cause conflicts with the host namesapce, I can imagine that inside the pod people may add different addresses, but this is a theory, we need more feedback on this. Usually the host interfaces must be managed by systemd when they come back to the root namesapce https://gist.github.com/aojea/e5787586a08313df51234e4d0c147df1

In the meantime, I think dhcp should be opt-in #143 and we can revisit later.

regarding the e2e test, absolutely, bats has also setup() and teardown() hooks that will allow to set up the interface per test to not have to create one, those are things I did by laziness that need to be fixed

@michaelasp
Copy link
Collaborator Author

Ah thanks for the RCA @gauravkghildiyal, I think #143 helps with the long delay but we still need to reconfigure the IP once it goes back into the host namespace. I'll add that for this test then. Let's discuss more on what this behavior should be.

@michaelasp michaelasp force-pushed the testreapply branch 2 times, most recently from 0eb61d9 to 1d9e4a9 Compare June 30, 2025 22:58
@gauravkghildiyal
Copy link
Member

Thanks! The test looks good to me. We can merge if this is no longer a WIP.

@aojea aojea marked this pull request as ready for review July 1, 2025 09:24
@aojea aojea changed the title WIP: Test reapply of pods for the same resource claim Test reapply of pods for the same resource claim Jul 1, 2025
@aojea aojea merged commit 3ef50d4 into google:main Jul 1, 2025
6 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants