Test reapply of pods for the same resource claim #140

michaelasp · 2025-06-26T18:10:30Z

Replicates an issue seen with reapplying the same deployment for a resource

michaelasp · 2025-06-26T18:13:50Z

On a side note, I'm not a huge fan of declaring new devices for each test like this. Should we clean up the dummy devices in between tests?

gauravkghildiyal · 2025-06-26T18:59:10Z

Thanks for looking into this @michaelasp

The current test is failing at line 266 which seems to be the first pod creation:

dranet/tests/e2e.bats

Line 266 in d372c0e

kubectl wait --timeout=30s --for=condition=ready pods -l app=MyApp

Aren't we trying to repro failure after the delete and recreate? Is this perhaps a different failure of our e2e repro?

michaelasp · 2025-06-26T19:00:59Z

Hmm strange, it didn't fail at that point locally for me. Let me trigger a rerun but it may be due to the fact I commented out other tests for a faster run.

e2e.bats
 ✗ reapply pod with dummy resource claim
   (in test file tests/e2e.bats, line 281)
     `kubectl wait --timeout=30s --for=condition=ready pods -l app=MyApp' failed
   deviceclass.resource.k8s.io/multinic created
   resourceclaimtemplate.resource.k8s.io/phy-interfaces-template created
   deployment.apps/server-deployment created
   pod/server-deployment-5f5bfc84cf-dxncz condition met

gauravkghildiyal · 2025-06-26T19:02:32Z

On a side note, I'm not a huge fan of declaring new devices for each test like this. Should we clean up the dummy devices in between tests?

Yes I completely agree. Some of this work is actually called out in #137

gauravkghildiyal · 2025-06-26T19:11:15Z

Still failing at 266. I think because we are using the same pod labels as the previous test, that has some lingering impact?

michaelasp · 2025-06-26T19:40:55Z

Still failing at 266. I think because we are using the same pod labels as the previous test, that has some lingering impact?

I though delete waited for the resources to go away, but to be safe I made it so we used different labels and deployment name just in case that was the issue. I think that we may need to add a bit of timeouts between tests to let things settle or figure out what exactly is causing these intermittent failures.

michaelasp · 2025-06-26T20:03:46Z

Now it's repro'd

# (in test file tests/e2e.bats, line 281)
#   `kubectl wait --timeout=30s --for=condition=ready pods -l app=reapplyApp' failed

gauravkghildiyal · 2025-06-27T21:32:28Z

Thanks for the repro @michaelasp

Checking the logs, the issue here stems because we don't (and most likely cannot properly) recreate the exact dummy interface after it is assigned to the first pod and then moved back to the root namespace.

Our test create a dummy interface and manually assign an IP

Pod 1 is created and dummy interface is moved to its network namespace. No DHCP is done because the interface already had an IP. (i.e. the following is NOT executed)

dranet/pkg/driver/dra_hooks.go

Lines 192 to 194 in 1a13475

    
           if len(podCfg.Network.Interface.Addresses) == 0 { 
        
           	klog.V(2).Infof("trying to get network configuration via DHCP") 
        
           	ip, routes, err := getDHCP(ifName)

Test deletes Pod 1 and the dummy interface is moved back to the root namespace
Our driver does NOT reassign an IP back to the interface. Since this is a dummy interface, there is no device driver which takes care of re-configuring the device and it remains without an IP
Pod 2 is created claiming this dummy interface.
This time around, because the interface does not have an IP, we try to do a DHCP request which could take a long time. If it finishes before the 45 seconds timeout (even if it fails), then the pod would get scheduled. If however it does not finish within the 45 second timeout, the kubelet could retry again.

While there definitely is scope for improvement in our code, in this sceario I'd say that perhaps this is something which is a byproduct of using a dummy interface and likely our tests may need to take care of re-configuring the address after the dummy interface is recycled from Pod 1.

aojea · 2025-06-28T09:23:59Z

The reason why the addresses are not brough back is because it can cause conflicts with the host namesapce, I can imagine that inside the pod people may add different addresses, but this is a theory, we need more feedback on this. Usually the host interfaces must be managed by systemd when they come back to the root namesapce https://gist.github.com/aojea/e5787586a08313df51234e4d0c147df1

In the meantime, I think dhcp should be opt-in #143 and we can revisit later.

regarding the e2e test, absolutely, bats has also setup() and teardown() hooks that will allow to set up the interface per test to not have to create one, those are things I did by laziness that need to be fixed

michaelasp · 2025-06-30T15:58:05Z

Ah thanks for the RCA @gauravkghildiyal, I think #143 helps with the long delay but we still need to reconfigure the IP once it goes back into the host namespace. I'll add that for this test then. Let's discuss more on what this behavior should be.

gauravkghildiyal · 2025-07-01T07:27:23Z

Thanks! The test looks good to me. We can merge if this is no longer a WIP.

michaelasp marked this pull request as draft June 26, 2025 18:10

michaelasp mentioned this pull request Jun 26, 2025

Recreating a pod with the same resource claim stalls #141

Closed

michaelasp force-pushed the testreapply branch from d372c0e to 6b32b3c Compare June 26, 2025 19:02

michaelasp force-pushed the testreapply branch from 6b32b3c to f15019e Compare June 26, 2025 19:40

michaelasp force-pushed the testreapply branch from f15019e to a7f434b Compare June 26, 2025 19:45

michaelasp force-pushed the testreapply branch 2 times, most recently from 0eb61d9 to 1d9e4a9 Compare June 30, 2025 22:58

test reapply of pods for the same resource claim

c9ccb28

michaelasp force-pushed the testreapply branch from 1d9e4a9 to c9ccb28 Compare June 30, 2025 23:23

gauravkghildiyal approved these changes Jul 1, 2025

View reviewed changes

aojea marked this pull request as ready for review July 1, 2025 09:24

aojea changed the title ~~WIP: Test reapply of pods for the same resource claim~~ Test reapply of pods for the same resource claim Jul 1, 2025

aojea merged commit 3ef50d4 into google:main Jul 1, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test reapply of pods for the same resource claim #140

Test reapply of pods for the same resource claim #140

Uh oh!

michaelasp commented Jun 26, 2025

Uh oh!

michaelasp commented Jun 26, 2025

Uh oh!

gauravkghildiyal commented Jun 26, 2025

Uh oh!

michaelasp commented Jun 26, 2025

Uh oh!

gauravkghildiyal commented Jun 26, 2025

Uh oh!

gauravkghildiyal commented Jun 26, 2025

Uh oh!

michaelasp commented Jun 26, 2025

Uh oh!

michaelasp commented Jun 26, 2025

Uh oh!

gauravkghildiyal commented Jun 27, 2025 •

edited

Loading

Uh oh!

aojea commented Jun 28, 2025 •

edited

Loading

Uh oh!

michaelasp commented Jun 30, 2025

Uh oh!

gauravkghildiyal commented Jul 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Test reapply of pods for the same resource claim #140

Test reapply of pods for the same resource claim #140

Uh oh!

Conversation

michaelasp commented Jun 26, 2025

Uh oh!

michaelasp commented Jun 26, 2025

Uh oh!

gauravkghildiyal commented Jun 26, 2025

Uh oh!

michaelasp commented Jun 26, 2025

Uh oh!

gauravkghildiyal commented Jun 26, 2025

Uh oh!

gauravkghildiyal commented Jun 26, 2025

Uh oh!

michaelasp commented Jun 26, 2025

Uh oh!

michaelasp commented Jun 26, 2025

Uh oh!

gauravkghildiyal commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aojea commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelasp commented Jun 30, 2025

Uh oh!

gauravkghildiyal commented Jul 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gauravkghildiyal commented Jun 27, 2025 •

edited

Loading

aojea commented Jun 28, 2025 •

edited

Loading