Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 8a713ef

Browse files
authored
[ML] Better messaging regarding OOM process termination (#2841)
This PR provides a more detailed message when a process is terminated with SIGKILL. On Linux, the OOM (Out Of Memory) system handler will kill processes, according to heuristics, when the OS runs low on memory. Our native processes (apart from controller) are configured so that they would be chosen first to be terminated in such a situation. The OOM handler terminates processes with a SIGKILL (signal 9). SIGKILL is not able to be handled by processes and will result in immediate termination, not allowing for any logging of the situation. However, the parent process - controller - can detect and report on the death of its children. Relates elastic/ml-team#1158
1 parent 51eda36 commit 8a713ef

File tree

2 files changed

+19
-4
lines changed

2 files changed

+19
-4
lines changed

docs/CHANGELOG.asciidoc

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,12 @@
4242

4343
* Update Linux build images to Rocky Linux 8 with gcc 13.3. (See {ml-pull}2773[#2773].)
4444

45+
== {es} version 8.19.0
46+
47+
=== Enhancements
48+
49+
* Better messaging regarding OOM process termination. (See {ml-pull}2841[#2841].)
50+
4551
== {es} version 8.18.0
4652

4753
=== Enhancements

lib/core/CDetachedProcessSpawner.cc

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -185,13 +185,22 @@ class CTrackerThread : public CThread {
185185
// at a lower level
186186
LOG_INFO(<< "Child process with PID " << pid
187187
<< " was terminated by signal " << signal);
188-
} else {
188+
} else if (signal == SIGKILL) {
189189
// This should never happen if the system is working
190190
// normally - possible reasons are the Linux OOM
191-
// killer, manual intervention and bugs that cause
192-
// access violations
191+
// killer or manual intervention. The latter is highly unlikely
192+
// if running in the cloud.
193+
LOG_ERROR(<< "Child process with PID " << pid << " was terminated by signal 9 (SIGKILL)."
194+
<< " This is likely due to the OOM killer."
195+
<< " Please check system logs for more details.");
196+
} else {
197+
// This should never happen if the system is working
198+
// normally - possible reasons are bugs that cause
199+
// access violations or manual intervention. The latter is highly unlikely
200+
// if running in the cloud.
193201
LOG_ERROR(<< "Child process with PID " << pid
194-
<< " was terminated by signal " << signal);
202+
<< " was terminated by signal " << signal
203+
<< " Please check system logs for more details.");
195204
}
196205
} else {
197206
int exitCode = WEXITSTATUS(status);

0 commit comments

Comments
 (0)