E2E storage: more tests for different pod/node combinations #72002

pohly · 2018-12-12T19:40:52Z

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

There was no test for running multiple pods at the same time one the same node, sharing the same volume. Such a test is relevant for CSI drivers which need to implement NodeStageVolume for this kind of usage to work.

Also not covered was accessing data on one node after writing it on another.

Special notes for your reviewer:

In order to implement the new test cleanly, the existing test infrastructure and other tests get refactored first.

Does this PR introduce a user-facing change?:

NONE

/sig storage
/cc @msau42

pohly · 2018-12-12T19:43:27Z

I am not sure whether it makes sense to enable this test for all test patterns. A full matrix (all tests for all configurations) of course reduces the chance of missing some situation where the test fails, but it also increases the overall test runtime.

For now I am following the "testsuites" approach where everything is enabled for all test patterns.

pohly · 2018-12-13T10:19:25Z

/hold

Waiting for other PRs to get merged first.

pohly · 2018-12-19T11:28:48Z

/retest

pohly · 2018-12-21T12:53:47Z

@msau42 I ended up using the approach where the provisioning test determines which nodes it has first, and then picks nodes by name. I think it is simpler this way and it automatically handles the case where the node selector only matches a single node.

pohly · 2018-12-21T18:45:37Z

/test pull-kubernetes-e2e-kops-aws

pohly · 2019-01-04T21:12:54Z

/test pull-kubernetes-e2e-kops-aws

pohly · 2019-01-04T22:53:20Z

/test pull-kubernetes-e2e-kops-aws

pohly · 2019-02-09T08:15:27Z

/test pull-kubernetes-e2e-gce-csi-serial

pohly · 2019-02-09T12:51:56Z

/hold

I need to check why [sig-storage] CSI Volumes [Driver: csi-hostpath-v0] [Testpattern: Dynamic Snapshot] snapshottable should create snapshot with defaults [Feature:VolumeSnapshotDataSource] isn't getting skipped.

Even if snapshots are supported by the driver interface, the driver or suite might still want to skip a particular test, so those checks still need to be executed.

This addresses the two remaining change requests from kubernetes#69036: - replace "csi-hostpath-v0" name check with capability check (cleaner that way) - add feature tag to "should create snapshot with defaults" because that is an alpha feature Signed-off-by: Patrick Ohly <[email protected]>

There is no need to check for empty strings, we can also directly initialize structs with the value. The end result is the same when the value is empty (empty string in the struct).

When the provisioning test gets stuck, the log fills up with messages about waiting for a certain pod to run. Now the pod names are pvc-[volume-tester|snapshot]-[writer|reader] plus the random number appended by Kubernetes. This makes it easier to see where the test is stuck.

TestDynamicProvisioning had multiple ways of choosing additional checks: - the PvCheck callback - the builtin write/read check controlled by a boolean - the snapshot testing Complicating matters further, that builtin write/read test had been more customizable with new fields `NodeSelector` and `ExpectUnschedulable` which were only set by one particular test (see kubernetes#70941). That is confusing and will only get more confusing when adding more checks in the future. Therefore the write/read check is now a separate function that must be enabled explicitly by tests that want to run it. The snapshot checking is also defined only for the snapshot test. The test that expects unschedulable pods now also checks for that particular situation itself. Instead of testing it with two pods (the behavior from the write/read check) that both fail to start, only a single unschedulable pod is created. Because node name, node selector and the `ExpectUnschedulable` were only used for checking, it is possible to simplify `StorageClassTest` by removing all of these fields. Expect(err).NotTo(HaveOccurred()) is an anti-pattern in Ginkgo testing because a test failure doesn't explain what failed (see kubernetes#34059). We avoid it now by making the check function itself responsible for checking errors and including more information in those checks.

pohly · 2019-02-09T13:21:50Z

Found it. Introducing the snapshot short-circuited some checks in skipUnsupportedTests. See f8536e8.

/hold cancel

pohly · 2019-02-09T13:22:05Z

/test pull-kubernetes-e2e-gce-alpha-features

pohly · 2019-02-10T09:27:59Z

@msau42 all tests are clean now in this PR and the right tests ran:

Passed:
[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource]
[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic Snapshot] snapshottable should create snapshot with defaults [Feature:VolumeSnapshotDataSource]

Skipped:
...
[sig-storage] CSI Volumes [Driver: csi-hostpath-v0] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource]
...

dims · 2019-02-10T15:07:28Z

/assign @msau42

msau42

/approve

Just one nit from me.

I was originally concerned that we would be losing test coverage by changing the RWCheck to only launch pods on a single node, but all the volume drivers tested in volume_provisioning.go should be covered in drivers/in_tree.go.

I think it's also ok for the regional_pd.go tests that use RWCheck to only check a single node because we will get multi-node coverage through the failover test cases.

msau42 · 2019-02-12T00:01:44Z

test/e2e/storage/testsuites/provisioning.go

+// persistent across pods.
+//
+// This is a common test that can be called from a StorageClassTest.PvCheck.
+func PVWriteReadCheck(client clientset.Interface, claim *v1.PersistentVolumeClaim, volume *v1.PersistentVolume, node NodeSelection) {


Can you rename this to PVWriteReadSingleNodeCheck to be more obvious about what the function is doing?

k8s-ci-robot · 2019-02-12T00:07:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e/storage/OWNERS~~ [msau42]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Whether the read test after writing was done on the same node was random for drivers that weren't locked onto a single node. Now it is deterministic: it always happens on the same node. The case with reading on another node is covered separately for test configurations that support it (not locked onto a single node, more than one node in the test cluster). As before, the TestConfig.ClientNodeSelector is ignored by the provisioning testsuite.

This is a special case that both kubelet and the volume driver should support, because users might expect it. One Kubernetes mechanism to deploy pods like this is via pod affinity. However, strictly speaking the CSI spec does not allow this usage mode (see container-storage-interface/spec#150) and there is an on-going debate to enable it (see container-storage-interface/spec#178). Therefore this test gets skipped unless explicitly enabled for a driver. CSI drivers which create a block device for a remote volume in NodePublishVolume fail this test. They have to make the volume available in NodeStageVolume and then in NodePublishVolume merely do a bind mount (as for example in https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/gce-pd-csi-driver/node.go#L150).

The driver should support multiple pods using the same volume on the same node.

k8s-ci-robot · 2019-02-12T09:33:05Z

@pohly: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-e2e-gce-csi-serial	ffc483a8bfa3250e97555a7fa628f58f786c2c31	link	`/test pull-kubernetes-e2e-gce-csi-serial`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

pohly · 2019-02-12T10:38:21Z

/retest

msau42 · 2019-02-12T16:33:25Z

/lgtm

This is conceptually cleaner and enables the running of multiple pods on the same node with the same volume. However, that particular use case is a bit contentious and upstream doesn't agree whether it should be supported (see discussion of kubernetes/kubernetes#72002).

k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Dec 12, 2018

k8s-ci-robot requested a review from msau42 December 12, 2018 19:40

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 13, 2018

pohly force-pushed the storage-volume-testsuites-concurrency branch 2 times, most recently from 0f8ce56 to e719947 Compare December 17, 2018 20:37

pohly force-pushed the storage-volume-testsuites-concurrency branch 4 times, most recently from 68f9671 to 376a633 Compare December 21, 2018 12:48

pohly mentioned this pull request Dec 21, 2018

e2e storage tests: usable out-of-tree #70862

Merged

pohly changed the title ~~E2E storage: concurrent pods~~ E2E storage: more tests for different pod/node combinations Dec 21, 2018

pohly mentioned this pull request Dec 28, 2018

PD multizone tests are flaky #72378

Closed

pohly force-pushed the storage-volume-testsuites-concurrency branch from 376a633 to efd2dff Compare January 4, 2019 08:38

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 4, 2019

pohly mentioned this pull request Jan 4, 2019

e2e/storage: avoid definining unusable tests #70992

Closed

pohly force-pushed the storage-volume-testsuites-concurrency branch from efd2dff to 3033bf6 Compare January 4, 2019 19:49

pohly added 5 commits February 9, 2019 14:19

e2e: fix snapshot skip test

f8536e8

Even if snapshots are supported by the driver interface, the driver or suite might still want to skip a particular test, so those checks still need to be executed.

e2e/storage: remove unnecessary empty string checks

e92bdd1

There is no need to check for empty strings, we can also directly initialize structs with the value. The end result is the same when the value is empty (empty string in the struct).

pohly force-pushed the storage-volume-testsuites-concurrency branch from ffc483a to f9a4124 Compare February 9, 2019 13:20

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2019

k8s-ci-robot assigned msau42 Feb 10, 2019

msau42 reviewed Feb 12, 2019

View reviewed changes

pohly added 3 commits February 12, 2019 09:21

e2e/storage: enable concurrent writes for gcepd

ecc0c4e

The driver should support multiple pods using the same volume on the same node.

pohly force-pushed the storage-volume-testsuites-concurrency branch from f9a4124 to ecc0c4e Compare February 12, 2019 08:22

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 12, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2019

k8s-ci-robot merged commit f968499 into kubernetes:master Feb 12, 2019

justinsb mentioned this pull request Feb 15, 2019

Failing storage tests: In-tree Volumes [Driver: local] [Testpattern: Pre-provisioned PV (xfs)] #74095

Closed

pohly mentioned this pull request Feb 19, 2019

CSI 1.0, code cleanup, test update intel/oim#1

Merged

E2E storage: more tests for different pod/node combinations #72002

E2E storage: more tests for different pod/node combinations #72002

Uh oh!

Conversation

pohly commented Dec 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pohly commented Dec 12, 2018

Uh oh!

pohly commented Dec 13, 2018

Uh oh!

pohly commented Dec 19, 2018

Uh oh!

pohly commented Dec 21, 2018

Uh oh!

pohly commented Dec 21, 2018

Uh oh!

pohly commented Jan 4, 2019

Uh oh!

pohly commented Jan 4, 2019

Uh oh!

pohly commented Feb 9, 2019

Uh oh!

pohly commented Feb 9, 2019

Uh oh!

pohly commented Feb 9, 2019

Uh oh!

pohly commented Feb 9, 2019

Uh oh!

pohly commented Feb 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dims commented Feb 10, 2019

Uh oh!

msau42 left a comment

Choose a reason for hiding this comment

Uh oh!

msau42 Feb 12, 2019

Choose a reason for hiding this comment

Uh oh!

pohly Feb 12, 2019

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Feb 12, 2019

Uh oh!

k8s-ci-robot commented Feb 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pohly commented Feb 12, 2019

Uh oh!

msau42 commented Feb 12, 2019

Uh oh!

Uh oh!

pohly commented Dec 12, 2018 •

edited

Loading

pohly commented Feb 10, 2019 •

edited

Loading

k8s-ci-robot commented Feb 12, 2019 •

edited

Loading