Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@hernandanielg
Copy link

What type of PR is this?

/kind feature

What this PR does / why we need it:

This will help on the memory debugging process

Which issue(s) this PR fixes:

Fixes #7307

Special notes for your reviewer:

Does this PR introduce a user-facing change?

This PR will add a --enable-heap-dump flag to enable head dumps on an specific file


@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 22, 2023

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 22, 2023
@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 22, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 22, 2023

Hi @hernandanielg. Thanks for your PR.

I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 22, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hernandanielg
Once this PR has been reviewed and has the lgtm label, please assign nalind for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Sep 22, 2023
@hernandanielg hernandanielg marked this pull request as draft September 22, 2023 12:30
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 22, 2023
@sohankunkerkar
Copy link
Member

@hernandanielg Thanks for the PR.
You are on the right track. There are still a few places where you need to make changes. Did you see this PR to understand the actual logic? The idea is to dump the heap-related information if --enable-heap-dump is mentioned. Have you gone through this example PR to understand how to add a command line flag to the crio config?

@sohankunkerkar
Copy link
Member

@hernandanielg are you still interested in pursuing this PR?

@hernandanielg
Copy link
Author

yes! sorry I've been busy lately but will continue on this today 👋🏻

@hernandanielg
Copy link
Author

hey @sohankunkerkar

As you may see I have added configuration option and reload function

I know it's just a matter to execute debug.WriteHeapDump(f.Fd()) as done here

I am trying to understand the code and find where and when it's going to be executed on runtime, maybe somewhere around here?

If you can guide me on this it would be highly appreciated!

Looking forward

@sohankunkerkar
Copy link
Member

@hernandanielg thanks for the changes. I think you need to add the flag details to docs/ as well. As far as I know, you can keep the logic in server/inspect.go and call it only when that flag is enabled.


// ReloadEnableHeapDump updates the DecryptionKeysPath with the provided
// `newConfig`.
func (c *Config) ReloadEnableHeapDump(newConfig *Config) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hernandanielg, wasn't this meant to be a command-line argument only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale behind this logic is to allow the user to change the location of the heap dump by reloading the config. So, we have to change the config option to a string type, rather than a boolean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sohankunkerkar, that makes sense. But, at the same time, what's wrong with using /tmp?

We aren't doing anything special for the goroutine dump, per:

cri-o/cmd/crio/main.go

Lines 40 to 48 in c4afa28

func writeCrioGoroutineStacks() {
path := filepath.Join("/tmp", fmt.Sprintf(
"crio-goroutine-stacks-%s.log",
strings.ReplaceAll(time.Now().Format(time.RFC3339), ":", ""),
))
if err := utils.WriteGoroutineStacksToFile(path); err != nil {
logrus.Warnf("Failed to write goroutine stacks: %s", err)
}
}

Users should guarantee that they have access to /tmp and that there is enough disk space available there.

We could make the goroutine dump a first-class citizen and include an endpoint for it, for example, supporting both the /dump/heap and /dump/threads. Both of these could support their location and pattern configuring via the crio.conf.

Few thoughts here following... Just thinking out loud.

I personally don't like this feature, as it can be abused as a vector for a potential DoS attack - calling this endpoint repeatedly could lead to disk space exhaustion, not to mention that there is a pause every time a heap dump has to be taken, thus there is a real impact on the application too. There is no authentication (basic or TLS certificates based) for these sorts of endpoints.

You could also then say - "oh, but it's only exposed via localhost!". Well, then, how useful this new feature is, given that the user can enable the option to expose pprof endpoints via the Unix socket, which is far more secure in contrast.

The ability to dump gorotines through sending SIGUSR1 to CRI-O is somewhat guarded through the notion of having to have proper permissions to send a signal to a different process, so the surface area open for abuse is a bit smaller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate your insights on this. Using /tmp does have its conveniences, and the goroutine dump feature, as it stands, has some considerations, especially regarding potential abuse as a vector for DoS attacks. Your points about disk space exhaustion and the impact on application performance are well taken.

I wonder if there's an opportunity to strike a balance—perhaps by providing more secure alternatives for configuring the dump location and pattern in crio.conf, and simultaneously addressing the potential risks associated with frequent calls to the endpoint. Additionally, considering authentication measures for these endpoints could enhance security.

It might be beneficial to discuss this with @mrunalp, the original contributor to this feature. Understanding the initial vision and intentions behind the pull request would provide valuable context, and we can collectively explore the best approach that aligns with both security and usability considerations.

@mrunalp, could we dive into a discussion about the considerations around the goroutine dump feature and its potential impact on security and performance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally, if a malicious user has access to the crio socket, they can do harm in a lot of different ways. admins should ensure only trusted users can access nodes. For instance, a user could DOS CRI-O by having systemd shut it off. Since this and the goroutine profiles are both debugging options, I think it's safe to guess a user who has access can be trusted to not DOS the node's memory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's mirror the other USR/HUP signal handling for now and have cri-o write to disk in an opinionated structure. we can evaluate other methods in the future and rewrite all of them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hernandanielg, we've had some internal conversations about how to move forward with this feature.

There is, of course, still an opportunity to complete this work here, so please be sure to bear with us a little bit. 😄

That said, I would like to bring everyone's attention to the following discussion:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok looking forward ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey any news about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hernandanielg, sorry for the delay!

A decision on how to proceed has stalled, so it seems. Apologies!

@haircommander
Copy link
Member

@hernandanielg are you still working on this? if not, @shipra101 could take it over

@hernandanielg
Copy link
Author

Yes, I'm sorry for the delay, please gimme until tomorrow to update this PR

Signed-off-by: Hernan Garcia <[email protected]>
@codecov
Copy link

codecov bot commented Dec 6, 2023

Codecov Report

Merging #7328 (4a618b6) into main (1e93fb4) will decrease coverage by 0.02%.
Report is 330 commits behind head on main.
The diff coverage is 45.45%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7328      +/-   ##
==========================================
- Coverage   48.68%   48.67%   -0.02%     
==========================================
  Files         145      145              
  Lines       15918    15940      +22     
==========================================
+ Hits         7749     7758       +9     
- Misses       7235     7246      +11     
- Partials      934      936       +2     

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 11, 2024
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions
Copy link

github-actions bot commented Apr 1, 2024

A friendly reminder that this PR had no activity for 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an endpoint for heap dumps

5 participants