Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

rphillips
Copy link
Member

@rphillips rphillips commented Oct 30, 2020

What type of PR is this?
/kind flake

What this PR does / why we need it:
There is a race when the GRPC server is coming up and the subsequent dial call for the unix socket. This fix waits within the stub with a retry to allow the server to start and the dial call to succeed.

=== RUN   TestDevicePluginReRegistrationProbeMode
I0904 23:07:54.382417 2813190 fake_topology_manager.go:29] [fake topologymanager] NewFakeManager
W0904 23:07:54.382652 2813190 manager.go:596] Failed to retrieve checkpoint for "kubelet_internal_checkpoint": checkpoint is not found
I0904 23:07:54.386357 2813190 plugin_manager.go:114] Starting Kubelet Plugin Manager
I0904 23:07:54.390968 2813190 device_plugin_stub.go:131] Starting to serve on /tmp/device_plugin394710885/device-plugin.sock
I0904 23:07:54.393796 2813190 device_plugin_stub.go:153] GetInfo
E0904 23:07:54.394527 2813190 goroutinemap.go:150] Operation for "/tmp/device_plugin394710885/server.sock" failed. No retries permitted until 2020-09-04 23:07:54.894365704 +0000 UTC m=+0.587686396 (durationBeforeRetry 500ms). Error: "RegisterPlugin error -- failed to get
 plugin info using RPC GetInfo at socket /tmp/device_plugin394710885/server.sock, err: rpc error: code = Unimplemented desc = unknown service pluginregistration.Registration"
I0904 23:07:54.401946 2813190 device_plugin_stub.go:227] ListAndWatch
I0904 23:07:54.408602 2813190 device_plugin_stub.go:131] Starting to serve on /tmp/device_plugin394710885/device-plugin.sock.new
E0904 23:07:55.390620 2813190 endpoint.go:107] listAndWatch ended unexpectedly for device plugin fake-domain/resource with error rpc error: code = Canceled desc = grpc: the client connection is closing
I0904 23:07:55.394203 2813190 device_plugin_stub.go:153] GetInfo
E0904 23:07:55.395579 2813190 goroutinemap.go:150] Operation for "/tmp/device_plugin394710885/server.sock" failed. No retries permitted until 2020-09-04 23:07:56.395409336 +0000 UTC m=+2.088730022 (durationBeforeRetry 1s). Error: "RegisterPlugin error -- failed to get pl
ugin info using RPC GetInfo at socket /tmp/device_plugin394710885/server.sock, err: rpc error: code = Unimplemented desc = unknown service pluginregistration.Registration"
I0904 23:07:55.402515 2813190 device_plugin_stub.go:153] GetInfo
I0904 23:07:55.407069 2813190 device_plugin_stub.go:227] ListAndWatch
I0904 23:07:55.410949 2813190 device_plugin_stub.go:227] ListAndWatch
I0904 23:07:55.416241 2813190 device_plugin_stub.go:131] Starting to serve on /tmp/device_plugin394710885/device-plugin.sock.third
    manager_test.go:222: 
        	Error Trace:	manager_test.go:222
        	Error:      	Not equal: 
        	            	expected: 1
        	            	actual  : 2
        	Test:       	TestDevicePluginReRegistrationProbeMode
        	Messages:   	Devices of previous registered should be removed
--- FAIL: TestDevicePluginReRegistrationProbeMode (1.04s)

Which issue(s) this PR fixes:

Fixes #94547

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/cc @sjenning @liggitt

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note-none Denotes a PR that doesn't merit a release note. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 30, 2020
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than a hard wait that makes every test that calls this take five seconds longer, can we PollImmediate on the dial attempt until it succeeds (with a backstop timeout that returns the dial error encountered)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. Fixed!

@rphillips rphillips force-pushed the fixes/device_plugin_stub_race branch from 79c0e7b to 44d4b52 Compare October 30, 2020 14:38
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 30, 2020
@rphillips rphillips force-pushed the fixes/device_plugin_stub_race branch 2 times, most recently from eac5682 to a6d906e Compare October 30, 2020 14:56
@rphillips
Copy link
Member Author

@sjenning and I collectively agreed on a 1 second interval... Let me know if you want anything different.

@sjenning
Copy link
Contributor

/approve
@liggitt can you add lgtm 1s is good for the interval?

@sjenning
Copy link
Contributor

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 30, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rphillips, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 30, 2020
@liggitt
Copy link
Member

liggitt commented Oct 30, 2020

1s is fine, left a comment about how to handle the dial failure case

@rphillips rphillips force-pushed the fixes/device_plugin_stub_race branch from a6d906e to e8897dc Compare October 30, 2020 16:23
@liggitt
Copy link
Member

liggitt commented Oct 30, 2020

compilepkg: error running subcommand: exit status 2
/bazel-scratch/.cache/bazel/_bazel_root/cae228f2a89ef5ee47c2085e441a3561/sandbox/linux-sandbox/1959/execroot/io_k8s_kubernetes/pkg/kubelet/cm/devicemanager/device_plugin_stub.go:131:24: cannot assign *grpc.ClientConn to conn (type *"net".Conn) in multiple assignment:
	*"net".Conn is pointer to interface, not interface

oops...

There is a race when the server is coming up and the subsequent dial on
the socket. Fix the race with a PollImmediate retry.
@rphillips rphillips force-pushed the fixes/device_plugin_stub_race branch from e8897dc to 4fdfbc7 Compare October 30, 2020 16:42
@rphillips
Copy link
Member Author

/test pull-kubernetes-e2e-gce-ubuntu-containerd

@rphillips
Copy link
Member Author

@liggitt could you re-review... thank you!

@liggitt
Copy link
Member

liggitt commented Nov 2, 2020

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky unit test] TestDevicePluginReRegistrationProbeMode: Devices of previous registered should be removed
4 participants