Thanks to visit codestin.com
Credit goes to github.com

Skip to content

E2E storage: more tests for different pod/node combinations #72002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Dec 12, 2018

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

There was no test for running multiple pods at the same time one the same node, sharing the same volume. Such a test is relevant for CSI drivers which need to implement NodeStageVolume for this kind of usage to work.

Also not covered was accessing data on one node after writing it on another.

Special notes for your reviewer:

In order to implement the new test cleanly, the existing test infrastructure and other tests get refactored first.

Does this PR introduce a user-facing change?:

NONE

/sig storage
/cc @msau42

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Dec 12, 2018
@k8s-ci-robot k8s-ci-robot requested a review from msau42 December 12, 2018 19:40
@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Dec 12, 2018
@pohly
Copy link
Contributor Author

pohly commented Dec 12, 2018

I am not sure whether it makes sense to enable this test for all test patterns. A full matrix (all tests for all configurations) of course reduces the chance of missing some situation where the test fails, but it also increases the overall test runtime.

For now I am following the "testsuites" approach where everything is enabled for all test patterns.

@pohly
Copy link
Contributor Author

pohly commented Dec 13, 2018

/hold

Waiting for other PRs to get merged first.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 13, 2018
@pohly pohly force-pushed the storage-volume-testsuites-concurrency branch 2 times, most recently from 0f8ce56 to e719947 Compare December 17, 2018 20:37
@pohly
Copy link
Contributor Author

pohly commented Dec 19, 2018

/retest

@pohly pohly force-pushed the storage-volume-testsuites-concurrency branch 4 times, most recently from 68f9671 to 376a633 Compare December 21, 2018 12:48
@pohly
Copy link
Contributor Author

pohly commented Dec 21, 2018

@msau42 I ended up using the approach where the provisioning test determines which nodes it has first, and then picks nodes by name. I think it is simpler this way and it automatically handles the case where the node selector only matches a single node.

@pohly pohly changed the title E2E storage: concurrent pods E2E storage: more tests for different pod/node combinations Dec 21, 2018
@pohly
Copy link
Contributor Author

pohly commented Dec 21, 2018

/test pull-kubernetes-e2e-kops-aws

@pohly pohly force-pushed the storage-volume-testsuites-concurrency branch from 376a633 to efd2dff Compare January 4, 2019 08:38
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 4, 2019
@pohly pohly force-pushed the storage-volume-testsuites-concurrency branch from efd2dff to 3033bf6 Compare January 4, 2019 19:49
@pohly
Copy link
Contributor Author

pohly commented Jan 4, 2019

/test pull-kubernetes-e2e-kops-aws

1 similar comment
@pohly
Copy link
Contributor Author

pohly commented Jan 4, 2019

/test pull-kubernetes-e2e-kops-aws

@pohly
Copy link
Contributor Author

pohly commented Feb 9, 2019

/test pull-kubernetes-e2e-gce-csi-serial

@pohly
Copy link
Contributor Author

pohly commented Feb 9, 2019

/hold

I need to check why [sig-storage] CSI Volumes [Driver: csi-hostpath-v0] [Testpattern: Dynamic Snapshot] snapshottable should create snapshot with defaults [Feature:VolumeSnapshotDataSource] isn't getting skipped.

pohly added 5 commits February 9, 2019 14:19
Even if snapshots are supported by the driver interface, the driver or
suite might still want to skip a particular test, so those checks
still need to be executed.
This addresses the two remaining change requests from
kubernetes#69036:
- replace "csi-hostpath-v0" name check with capability
  check (cleaner that way)
- add feature tag to "should create snapshot with defaults" because
  that is an alpha feature

Signed-off-by: Patrick Ohly <[email protected]>
There is no need to check for empty strings, we can also directly
initialize structs with the value. The end result is the same when the
value is empty (empty string in the struct).
When the provisioning test gets stuck, the log fills up with messages
about waiting for a certain pod to run. Now the pod names are
pvc-[volume-tester|snapshot]-[writer|reader] plus the random
number appended by Kubernetes. This makes it easier to see where the
test is stuck.
TestDynamicProvisioning had multiple ways of choosing additional
checks:
- the PvCheck callback
- the builtin write/read check controlled by a boolean
- the snapshot testing

Complicating matters further, that builtin write/read test had been
more customizable with new fields `NodeSelector` and
`ExpectUnschedulable` which were only set by one particular test (see
kubernetes#70941).

That is confusing and will only get more confusing when adding more
checks in the future. Therefore the write/read check is now a separate
function that must be enabled explicitly by tests that want to run it.
The snapshot checking is also defined only for the snapshot test.

The test that expects unschedulable pods now also checks for that
particular situation itself. Instead of testing it with two pods (the
behavior from the write/read check) that both fail to start, only a
single unschedulable pod is created.

Because node name, node selector and the `ExpectUnschedulable` were
only used for checking, it is possible to simplify `StorageClassTest`
by removing all of these fields.

Expect(err).NotTo(HaveOccurred()) is an anti-pattern in Ginkgo testing
because a test failure doesn't explain what failed (see
kubernetes#34059). We avoid it
now by making the check function itself responsible for checking
errors and including more information in those checks.
@pohly pohly force-pushed the storage-volume-testsuites-concurrency branch from ffc483a to f9a4124 Compare February 9, 2019 13:20
@pohly
Copy link
Contributor Author

pohly commented Feb 9, 2019

Found it. Introducing the snapshot short-circuited some checks in skipUnsupportedTests. See f8536e8.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2019
@pohly
Copy link
Contributor Author

pohly commented Feb 9, 2019

/test pull-kubernetes-e2e-gce-alpha-features

@pohly
Copy link
Contributor Author

pohly commented Feb 10, 2019

@msau42 all tests are clean now in this PR and the right tests ran:

Passed:
[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource]
[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic Snapshot] snapshottable should create snapshot with defaults [Feature:VolumeSnapshotDataSource]

Skipped:
...
[sig-storage] CSI Volumes [Driver: csi-hostpath-v0] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource]
...

@dims
Copy link
Member

dims commented Feb 10, 2019

/assign @msau42

Copy link
Member

@msau42 msau42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Just one nit from me.

I was originally concerned that we would be losing test coverage by changing the RWCheck to only launch pods on a single node, but all the volume drivers tested in volume_provisioning.go should be covered in drivers/in_tree.go.

I think it's also ok for the regional_pd.go tests that use RWCheck to only check a single node because we will get multi-node coverage through the failover test cases.

// persistent across pods.
//
// This is a common test that can be called from a StorageClassTest.PvCheck.
func PVWriteReadCheck(client clientset.Interface, claim *v1.PersistentVolumeClaim, volume *v1.PersistentVolume, node NodeSelection) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this to PVWriteReadSingleNodeCheck to be more obvious about what the function is doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Whether the read test after writing was done on the same node was
random for drivers that weren't locked onto a single node. Now it is
deterministic: it always happens on the same node.

The case with reading on another node is covered separately for test
configurations that support it (not locked onto a single node, more
than one node in the test cluster).

As before, the TestConfig.ClientNodeSelector is ignored by the
provisioning testsuite.
This is a special case that both kubelet and the volume driver should
support, because users might expect it. One Kubernetes mechanism to
deploy pods like this is via pod affinity.

However, strictly speaking the CSI spec does not allow this usage
mode (see container-storage-interface/spec#150) and
there is an on-going debate to enable it (see
container-storage-interface/spec#178). Therefore
this test gets skipped unless explicitly enabled for a driver.

CSI drivers which create a block device for a remote volume in
NodePublishVolume fail this test. They have to make the volume
available in NodeStageVolume and then in NodePublishVolume merely do a
bind mount (as for example in
https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/gce-pd-csi-driver/node.go#L150).
The driver should support multiple pods using the same volume on the
same node.
@pohly pohly force-pushed the storage-volume-testsuites-concurrency branch from f9a4124 to ecc0c4e Compare February 12, 2019 08:22
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 12, 2019
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Feb 12, 2019

@pohly: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-csi-serial ffc483a8bfa3250e97555a7fa628f58f786c2c31 link /test pull-kubernetes-e2e-gce-csi-serial

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@pohly
Copy link
Contributor Author

pohly commented Feb 12, 2019

/retest

@msau42
Copy link
Member

msau42 commented Feb 12, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2019
@k8s-ci-robot k8s-ci-robot merged commit f968499 into kubernetes:master Feb 12, 2019
pohly added a commit to intel/oim that referenced this pull request Feb 19, 2019
This is conceptually cleaner and enables the running of multiple pods
on the same node with the same volume. However, that particular use
case is a bit contentious and upstream doesn't agree whether it should
be supported (see discussion of
kubernetes/kubernetes#72002).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants