-
Notifications
You must be signed in to change notification settings - Fork 1.1k
enable heap dumps #7328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable heap dumps #7328
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Hi @hernandanielg. Thanks for your PR. I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hernandanielg The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
cc7fb40 to
b7de844
Compare
|
@hernandanielg Thanks for the PR. |
|
@hernandanielg are you still interested in pursuing this PR? |
|
yes! sorry I've been busy lately but will continue on this today 👋🏻 |
b7de844 to
a3895ca
Compare
|
hey @sohankunkerkar As you may see I have added configuration option and reload function I know it's just a matter to execute I am trying to understand the code and find where and when it's going to be executed on runtime, maybe somewhere around here? If you can guide me on this it would be highly appreciated! Looking forward |
|
@hernandanielg thanks for the changes. I think you need to add the flag details to |
pkg/config/reload.go
Outdated
|
|
||
| // ReloadEnableHeapDump updates the DecryptionKeysPath with the provided | ||
| // `newConfig`. | ||
| func (c *Config) ReloadEnableHeapDump(newConfig *Config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hernandanielg, wasn't this meant to be a command-line argument only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale behind this logic is to allow the user to change the location of the heap dump by reloading the config. So, we have to change the config option to a string type, rather than a boolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sohankunkerkar, that makes sense. But, at the same time, what's wrong with using /tmp?
We aren't doing anything special for the goroutine dump, per:
Lines 40 to 48 in c4afa28
| func writeCrioGoroutineStacks() { | |
| path := filepath.Join("/tmp", fmt.Sprintf( | |
| "crio-goroutine-stacks-%s.log", | |
| strings.ReplaceAll(time.Now().Format(time.RFC3339), ":", ""), | |
| )) | |
| if err := utils.WriteGoroutineStacksToFile(path); err != nil { | |
| logrus.Warnf("Failed to write goroutine stacks: %s", err) | |
| } | |
| } |
Users should guarantee that they have access to /tmp and that there is enough disk space available there.
We could make the goroutine dump a first-class citizen and include an endpoint for it, for example, supporting both the /dump/heap and /dump/threads. Both of these could support their location and pattern configuring via the crio.conf.
Few thoughts here following... Just thinking out loud.
I personally don't like this feature, as it can be abused as a vector for a potential DoS attack - calling this endpoint repeatedly could lead to disk space exhaustion, not to mention that there is a pause every time a heap dump has to be taken, thus there is a real impact on the application too. There is no authentication (basic or TLS certificates based) for these sorts of endpoints.
You could also then say - "oh, but it's only exposed via localhost!". Well, then, how useful this new feature is, given that the user can enable the option to expose pprof endpoints via the Unix socket, which is far more secure in contrast.
The ability to dump gorotines through sending SIGUSR1 to CRI-O is somewhat guarded through the notion of having to have proper permissions to send a signal to a different process, so the surface area open for abuse is a bit smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate your insights on this. Using /tmp does have its conveniences, and the goroutine dump feature, as it stands, has some considerations, especially regarding potential abuse as a vector for DoS attacks. Your points about disk space exhaustion and the impact on application performance are well taken.
I wonder if there's an opportunity to strike a balance—perhaps by providing more secure alternatives for configuring the dump location and pattern in crio.conf, and simultaneously addressing the potential risks associated with frequent calls to the endpoint. Additionally, considering authentication measures for these endpoints could enhance security.
It might be beneficial to discuss this with @mrunalp, the original contributor to this feature. Understanding the initial vision and intentions behind the pull request would provide valuable context, and we can collectively explore the best approach that aligns with both security and usability considerations.
@mrunalp, could we dive into a discussion about the considerations around the goroutine dump feature and its potential impact on security and performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally, if a malicious user has access to the crio socket, they can do harm in a lot of different ways. admins should ensure only trusted users can access nodes. For instance, a user could DOS CRI-O by having systemd shut it off. Since this and the goroutine profiles are both debugging options, I think it's safe to guess a user who has access can be trusted to not DOS the node's memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's mirror the other USR/HUP signal handling for now and have cri-o write to disk in an opinionated structure. we can evaluate other methods in the future and rewrite all of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hernandanielg, we've had some internal conversations about how to move forward with this feature.
There is, of course, still an opportunity to complete this work here, so please be sure to bear with us a little bit. 😄
That said, I would like to bring everyone's attention to the following discussion:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok looking forward ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey any news about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hernandanielg, sorry for the delay!
A decision on how to proceed has stalled, so it seems. Apologies!
|
@hernandanielg are you still working on this? if not, @shipra101 could take it over |
|
Yes, I'm sorry for the delay, please gimme until tomorrow to update this PR |
a3895ca to
0b9b296
Compare
Signed-off-by: Hernan Garcia <[email protected]>
0b9b296 to
4a618b6
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #7328 +/- ##
==========================================
- Coverage 48.68% 48.67% -0.02%
==========================================
Files 145 145
Lines 15918 15940 +22
==========================================
+ Hits 7749 7758 +9
- Misses 7235 7246 +11
- Partials 934 936 +2 |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
A friendly reminder that this PR had no activity for 30 days. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This will help on the memory debugging process
Which issue(s) this PR fixes:
Fixes #7307
Special notes for your reviewer:
Does this PR introduce a user-facing change?
This PR will add a
--enable-heap-dumpflag to enable head dumps on an specific file