Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@haircommander
Copy link
Member

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it:

I've wrapped a few fixes in one, as they're all related to cleaning up after failed container creates. More details in each commit

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed a bug where a container creation failure caused that container to leak in the runtime

@openshift-ci-robot openshift-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Sep 15, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 15, 2020
@codecov
Copy link

codecov bot commented Sep 15, 2020

Codecov Report

Merging #4198 into master will decrease coverage by 0.01%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #4198      +/-   ##
==========================================
- Coverage   38.59%   38.57%   -0.02%     
==========================================
  Files         111      111              
  Lines        8893     8897       +4     
==========================================
  Hits         3432     3432              
- Misses       5077     5081       +4     
  Partials      384      384              

@fidencio
Copy link
Contributor

@haircommander, could this be linked to #4000?

@haircommander
Copy link
Member Author

yes that certainly does seem plausable, good thought

@fidencio
Copy link
Contributor

yes that certainly does seem plausable, good thought

Sadly this PR doesn't solve that issue mentioned.
I'll do formal a review Tomorrow.

@umohnani8
Copy link
Member

LGTM

@fidencio
Copy link
Contributor

The changes do not fix #4000, but they do make sense to me.
/lgtm

@nee1esh
Copy link

nee1esh commented Sep 16, 2020

/retest

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Sep 16, 2020
@nee1esh
Copy link

nee1esh commented Sep 17, 2020

/retest

@TomSweeneyRedHat
Copy link
Contributor

LGTM
but tests are still struggling.

@haircommander haircommander force-pushed the cleanup-fixes branch 2 times, most recently from 1392481 to 89e450e Compare September 24, 2020 20:20
@saschagrunert
Copy link
Member

/retest

@umohnani8
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2020
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 5, 2020
@mrunalp
Copy link
Member

mrunalp commented Oct 5, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 5, 2020
@haircommander
Copy link
Member Author

/retest

Comment on lines 378 to 379
err2 := s.StorageRuntimeServer().DeleteContainer(containerInfo.ID)
if err2 != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err2 := s.StorageRuntimeServer().DeleteContainer(containerInfo.ID)
if err2 != nil {
err := s.StorageRuntimeServer().DeleteContainer(containerInfo.ID); if err != nil {

defer func() {
if retErr != nil {
log.Infof(ctx, "createCtr: removing container ID %s from runtime", ctr.ID())
if err2 := s.Runtime().DeleteContainer(newContainer); err2 != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err2 := s.Runtime().DeleteContainer(newContainer); err2 != nil {
if err := s.Runtime().DeleteContainer(newContainer); err != nil {

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 5, 2020
@haircommander
Copy link
Member Author

(most) comments addressed, PTAL

@haircommander
Copy link
Member Author

/retest

@haircommander
Copy link
Member Author

# time="2020-10-05T22:33:44Z" level=fatal msg="creating container: rpc error: code = Unknown desc = Error reading blob sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4: Get \"https://quayio-production-s3.s3.amazonaws.com/sharedimages/86f0a285-6f29-47c4-a3ae-7e2c70cad0ba/layer?Signature=ad6wfYCYNa0JcT1FjR74F4fyvLk%3D&Expires=1601937794&AWSAccessKeyId=AKIAI5LUAQGPZRPNKSJA\": dial tcp 52.216.16.176:443: i/o timeout"

/retest

to improve readability

Signed-off-by: Peter Hunt <[email protected]>
otherwise, we leak containers in runc

Signed-off-by: Peter Hunt <[email protected]>
@haircommander
Copy link
Member Author

@kolyshkin
Copy link
Collaborator

time="2020-10-05T22:33:44Z" level=fatal msg="creating container: rpc error: code = Unknown desc = Error reading blob sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4: Get "https://quayio-production-s3.s3.amazonaws.com/sharedimages/86f0a285-6f29-47c4-a3ae-7e2c70cad0ba/layer?Signature=ad6wfYCYNa0JcT1FjR74F4fyvLk%3D&Expires=1601937794&AWSAccessKeyId=AKIAI5LUAQGPZRPNKSJA\": dial tcp 52.216.16.176:443: i/o timeout"

Most (but not all) of the images we use in bats tests are prefetched and copied to every crio test root, but it still re-checks for updates (if I get it right). @haircommander @mrunalp what do you think about introducing a flag to crio to not re-check an image from network if it's already there, and use it for such tests (and only for tests)? This should result in less flakes like the one above (and slightly faster tests, but that's not the point).

Copy link
Collaborator

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

(I've checked and yes indeed, there's no functional change in the first commit)

@haircommander
Copy link
Member Author

haircommander commented Oct 14, 2020

time="2020-10-05T22:33:44Z" level=fatal msg="creating container: rpc error: code = Unknown desc = Error reading blob sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4: Get "[https://quayio-production-s3.s3.amazonaws.com/sharedimages/86f0a285-6f29-47c4-a3ae-7e2c70cad0ba/layer?Signature=ad6wfYCYNa0JcT1FjR74F4fyvLk%3D&Expires=1601937794&AWSAccessKeyId=AKIAI5LUAQGPZRPNKSJA](https://quayio-production-s3.s3.amazonaws.com/sharedimages/86f0a285-6f29-47c4-a3ae-7e2c70cad0ba/layer?Signature=ad6wfYCYNa0JcT1FjR74F4fyvLk%3D&Expires=1601937794&AWSAccessKeyId=AKIAI5LUAQGPZRPNKSJA%5C)": dial tcp 52.216.16.176:443: i/o timeout"

Most (but not all) of the images we use in bats tests are prefetched and copied to every crio test root, but it still re-checks for updates (if I get it right). @haircommander @mrunalp what do you think about introducing a flag to crio to not re-check an image from network if it's already there, and use it for such tests (and only for tests)? This should result in less flakes like the one above (and slightly faster tests, but that's not the point).

I believe the problem is actually containers/container-libs#267
IIRC, the binary that we use to cache the images copyimg doesn't save some metadata:

# time="2020-06-04 14:46:24.280309104Z" level=debug msg="Image config digest is empty, re-pulling image" file="server/image_pull.go:145" id=15bfe6da-0d11-4b25-9d24-d41de555cd46 name=/runtime.v1alpha2.ImageService/PullImage
# time="2020-06-04 14:46:24.280379949Z" level=debug msg="Image in store has different ID, re-pulling quay.io/crio/redis:alpine" file="server/image_pull.go:162" id=15bfe6da-0d11-4b25-9d24-d41de555cd46 name=/runtime.v1alpha2.ImageService/PullImage

source

when really, the problem is that that metadata isn't needed per-se, because that specific layer is an empty one

I never got a full grasp of the problem though

Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 14, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, kolyshkin, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [haircommander,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@haircommander
Copy link
Member Author

TASK [clone kubernetes source repo] ********************************************
task path: /go/src/github.com/cri-o/cri-o/contrib/test/integration/build/kubernetes.yml:3
fatal: [localhost]: FAILED! => {"changed": false, "cmd": ["/usr/bin/git", "fetch", "--tags", "origin"], "msg": "Failed to download remote objects and refs:  error: RPC failed; curl 18 transfer closed with outstanding read data remaining\nfatal: the remote end hung up unexpectedly\n"}

seems like a git flake
/retest

@openshift-merge-robot openshift-merge-robot merged commit fb658da into cri-o:master Oct 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants