-
Notifications
You must be signed in to change notification settings - Fork 40.7k
E2E storage: more tests for different pod/node combinations #72002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2E storage: more tests for different pod/node combinations #72002
Conversation
I am not sure whether it makes sense to enable this test for all test patterns. A full matrix (all tests for all configurations) of course reduces the chance of missing some situation where the test fails, but it also increases the overall test runtime. For now I am following the "testsuites" approach where everything is enabled for all test patterns. |
/hold Waiting for other PRs to get merged first. |
0f8ce56
to
e719947
Compare
/retest |
68f9671
to
376a633
Compare
@msau42 I ended up using the approach where the provisioning test determines which nodes it has first, and then picks nodes by name. I think it is simpler this way and it automatically handles the case where the node selector only matches a single node. |
/test pull-kubernetes-e2e-kops-aws |
376a633
to
efd2dff
Compare
efd2dff
to
3033bf6
Compare
/test pull-kubernetes-e2e-kops-aws |
1 similar comment
/test pull-kubernetes-e2e-kops-aws |
/test pull-kubernetes-e2e-gce-csi-serial |
/hold I need to check why |
Even if snapshots are supported by the driver interface, the driver or suite might still want to skip a particular test, so those checks still need to be executed.
This addresses the two remaining change requests from kubernetes#69036: - replace "csi-hostpath-v0" name check with capability check (cleaner that way) - add feature tag to "should create snapshot with defaults" because that is an alpha feature Signed-off-by: Patrick Ohly <[email protected]>
There is no need to check for empty strings, we can also directly initialize structs with the value. The end result is the same when the value is empty (empty string in the struct).
When the provisioning test gets stuck, the log fills up with messages about waiting for a certain pod to run. Now the pod names are pvc-[volume-tester|snapshot]-[writer|reader] plus the random number appended by Kubernetes. This makes it easier to see where the test is stuck.
TestDynamicProvisioning had multiple ways of choosing additional checks: - the PvCheck callback - the builtin write/read check controlled by a boolean - the snapshot testing Complicating matters further, that builtin write/read test had been more customizable with new fields `NodeSelector` and `ExpectUnschedulable` which were only set by one particular test (see kubernetes#70941). That is confusing and will only get more confusing when adding more checks in the future. Therefore the write/read check is now a separate function that must be enabled explicitly by tests that want to run it. The snapshot checking is also defined only for the snapshot test. The test that expects unschedulable pods now also checks for that particular situation itself. Instead of testing it with two pods (the behavior from the write/read check) that both fail to start, only a single unschedulable pod is created. Because node name, node selector and the `ExpectUnschedulable` were only used for checking, it is possible to simplify `StorageClassTest` by removing all of these fields. Expect(err).NotTo(HaveOccurred()) is an anti-pattern in Ginkgo testing because a test failure doesn't explain what failed (see kubernetes#34059). We avoid it now by making the check function itself responsible for checking errors and including more information in those checks.
ffc483a
to
f9a4124
Compare
Found it. Introducing the snapshot short-circuited some checks in /hold cancel |
/test pull-kubernetes-e2e-gce-alpha-features |
@msau42 all tests are clean now in this PR and the right tests ran: Passed: Skipped: |
/assign @msau42 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Just one nit from me.
I was originally concerned that we would be losing test coverage by changing the RWCheck to only launch pods on a single node, but all the volume drivers tested in volume_provisioning.go should be covered in drivers/in_tree.go.
I think it's also ok for the regional_pd.go tests that use RWCheck to only check a single node because we will get multi-node coverage through the failover test cases.
// persistent across pods. | ||
// | ||
// This is a common test that can be called from a StorageClassTest.PvCheck. | ||
func PVWriteReadCheck(client clientset.Interface, claim *v1.PersistentVolumeClaim, volume *v1.PersistentVolume, node NodeSelection) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename this to PVWriteReadSingleNodeCheck to be more obvious about what the function is doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Whether the read test after writing was done on the same node was random for drivers that weren't locked onto a single node. Now it is deterministic: it always happens on the same node. The case with reading on another node is covered separately for test configurations that support it (not locked onto a single node, more than one node in the test cluster). As before, the TestConfig.ClientNodeSelector is ignored by the provisioning testsuite.
This is a special case that both kubelet and the volume driver should support, because users might expect it. One Kubernetes mechanism to deploy pods like this is via pod affinity. However, strictly speaking the CSI spec does not allow this usage mode (see container-storage-interface/spec#150) and there is an on-going debate to enable it (see container-storage-interface/spec#178). Therefore this test gets skipped unless explicitly enabled for a driver. CSI drivers which create a block device for a remote volume in NodePublishVolume fail this test. They have to make the volume available in NodeStageVolume and then in NodePublishVolume merely do a bind mount (as for example in https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/gce-pd-csi-driver/node.go#L150).
The driver should support multiple pods using the same volume on the same node.
f9a4124
to
ecc0c4e
Compare
@pohly: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
/lgtm |
This is conceptually cleaner and enables the running of multiple pods on the same node with the same volume. However, that particular use case is a bit contentious and upstream doesn't agree whether it should be supported (see discussion of kubernetes/kubernetes#72002).
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
There was no test for running multiple pods at the same time one the same node, sharing the same volume. Such a test is relevant for CSI drivers which need to implement
NodeStageVolume
for this kind of usage to work.Also not covered was accessing data on one node after writing it on another.
Special notes for your reviewer:
In order to implement the new test cleanly, the existing test infrastructure and other tests get refactored first.
Does this PR introduce a user-facing change?:
/sig storage
/cc @msau42