-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[LLDB] Ptrace seize dead processes on Linux #137041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
50df550
to
8de35b8
Compare
We already have one piece of "status" parsing code in
I don't know the answer to that, but I can say that I don't think this feature needs to be (or should be) specific to this use case. One of the things that I would like to be able to do is to stop a process right before it exits (regardless of whether that's through the exit syscall, or a fatal signal, etc.). I think the tricky part is that (in both cases) the user might legitimately want to let the process exit, and "continue" is the normal way to do that, so I don't think we'd want to just error out of the continue command (or from the |
I think disallowing any non explicit continues/disconnect is a good user experience as long as we display an appropriate message. The workflow I imagine is when halted in this state any explicit
Will refactor, I looked for something for status and it seems I missed something |
3b10fcd
to
f1574f3
Compare
I see this is still a draft, but to avoid surprised, I want to say that I think this should be two or three patches in the final form. One for the PTRACE_SEIZE thingy, one for the "mechanism to prevent a process from resuming" and maybe (depending on how involved it gets) one for refactoring the /proc/status parser. |
I'm okay with that, I'm still in the 'experiment and see what happens phase' when it comes to preventing continue. How does this proposal sound:
For #3, I think it's got some loose scope around if it should replace proc stat or be in addition to it. The biggest complexity here is we're adding information into qProcessInfo that isn't exclusively about the process but now about how we're interacting with the process. So I think tackling that as it's own step makes sense. |
lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp
Outdated
Show resolved
Hide resolved
Also if/when you commit parts of this, please include a version of the example gist in one of the commit messages, it might be useful in future. |
Can you make the same thing happen without using a coredumper? I feel like the answer is a solid no but I'm not sure why. Another way we can do it is to write a test that checks that if the remote says it's in a non-resumable state, we act in a certain way. Only half the story but it's something. |
@@ -1304,6 +1304,9 @@ void GDBRemoteCommunicationServerCommon:: | |||
if (!abi.empty()) | |||
response.Printf("elf_abi:%s;", abi.c_str()); | |||
response.Printf("ptrsize:%d;", proc_arch.GetAddressByteSize()); | |||
std::optional<bool> non_resumable = proc_info.IsNonResumable(); | |||
if (non_resumable) | |||
response.Printf("non_resumable:%d", *non_resumable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of qProcessInfo (https://lldb.llvm.org/resources/lldbgdbremote.html#qprocessinfo) which is, I presume, only requested once because all the information is constant for the process lifetime.
At least for the situation at hand, non-resumeable is also constant. Though the process had to get into that state somehow, but if you were debugging it before the non-resumable point, it wouldn't have got into the non-resumeable state anyway so it makes no difference.
So unless anyone can think of a situation where non-resumeable could change, this packet is probably fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your intuition is correct, for now this works, but in the future (if we want to support O_TRACEEXIT
), we would need to update this. Currently I can get away with attach returning a process info that we can't resume.
This is still very WIP, as I'm trying to sort with Greg the gotcha's. I will break this patch up into pieces soon :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, so something would get added to the remote protocol to make this work but exactly what we can decide later.
I'd put this first (in which case it wouldn't be called "move" but "extend" or "refactor"), for two reasons:
|
8f6bb44
to
5c25880
Compare
@llvm/pr-subscribers-lldb Author: Jacob Lalonde (Jlalond) ChangesThis the actual PR to my SEIZE RFC. This is currently the bare bones on seizing a dead process, and being able to attach and introspect with LLDB. Additionally, right now I only check proc status before seize, and we should double check after seize that the process has not changed. Worth noting is once you seize a coredumping process (and it hits trace stop), Coredumping in status will now report 0. This is pretty complicated to test because it requires integration with the Kernel, thankfully the setup only involves some very simple toy programs, which I have outlined with instructions in this gist Full diff: https://github.com/llvm/llvm-project/pull/137041.diff 2 Files Affected:
diff --git a/lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp b/lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
index 7f2aba0e4eb2c..141e49d8a0b7e 100644
--- a/lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
+++ b/lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
@@ -312,10 +312,26 @@ NativeProcessLinux::Manager::Attach(
Log *log = GetLog(POSIXLog::Process);
LLDB_LOG(log, "pid = {0:x}", pid);
- auto tids_or = NativeProcessLinux::Attach(pid);
- if (!tids_or)
- return tids_or.takeError();
- ArrayRef<::pid_t> tids = *tids_or;
+ // This safety check lets us decide if we should
+ // seize or attach.
+ ProcessInstanceInfo process_info;
+ if (!Host::GetProcessInfo(pid, process_info))
+ return llvm::make_error<StringError>("Unable to read process info",
+ llvm::inconvertibleErrorCode());
+
+ std::vector<::pid_t> tids;
+ if (process_info.IsCoreDumping()) {
+ auto attached_or = NativeProcessLinux::Seize(pid);
+ if (!attached_or)
+ return attached_or.takeError();
+ tids = std::move(*attached_or);
+ } else {
+ auto attached_or = NativeProcessLinux::Attach(pid);
+ if (!attached_or)
+ return attached_or.takeError();
+ tids = std::move(*attached_or);
+ }
+
llvm::Expected<ArchSpec> arch_or =
NativeRegisterContextLinux::DetermineArchitecture(tids[0]);
if (!arch_or)
@@ -444,6 +460,88 @@ NativeProcessLinux::NativeProcessLinux(::pid_t pid, int terminal_fd,
SetState(StateType::eStateStopped, false);
}
+llvm::Expected<std::vector<::pid_t>> NativeProcessLinux::Seize(::pid_t pid) {
+ Log *log = GetLog(POSIXLog::Process);
+
+ uint64_t options = GetDefaultPtraceOpts();
+ Status status;
+ // Use a map to keep track of the threads which we have attached/need to
+ // attach.
+ Host::TidMap tids_to_attach;
+ while (Host::FindProcessThreads(pid, tids_to_attach)) {
+ for (Host::TidMap::iterator it = tids_to_attach.begin();
+ it != tids_to_attach.end();) {
+ if (it->second == true) {
+ continue;
+ }
+ lldb::tid_t tid = it->first;
+ if ((status = PtraceWrapper(PTRACE_SEIZE, tid, nullptr, (void *)options))
+ .Fail()) {
+ // No such thread. The thread may have exited. More error handling
+ // may be needed.
+ if (status.GetError() == ESRCH) {
+ it = tids_to_attach.erase(it);
+ continue;
+ }
+ if (status.GetError() == EPERM) {
+ // Depending on the value of ptrace_scope, we can return a
+ // different error that suggests how to fix it.
+ return AddPtraceScopeNote(status.ToError());
+ }
+ return status.ToError();
+ }
+
+ if ((status = PtraceWrapper(PTRACE_INTERRUPT, tid)).Fail()) {
+ // No such thread. The thread may have exited. More error handling
+ // may be needed.
+ if (status.GetError() == ESRCH) {
+ it = tids_to_attach.erase(it);
+ continue;
+ }
+ if (status.GetError() == EPERM) {
+ // Depending on the value of ptrace_scope, we can return a
+ // different error that suggests how to fix it.
+ return AddPtraceScopeNote(status.ToError());
+ }
+ return status.ToError();
+ }
+
+ int wpid =
+ llvm::sys::RetryAfterSignal(-1, ::waitpid, tid, nullptr, __WALL);
+ // Need to use __WALL otherwise we receive an error with errno=ECHLD At
+ // this point we should have a thread stopped if waitpid succeeds.
+ if (wpid < 0) {
+ // No such thread. The thread may have exited. More error handling
+ // may be needed.
+ if (errno == ESRCH) {
+ it = tids_to_attach.erase(it);
+ continue;
+ }
+ return llvm::errorCodeToError(
+ std::error_code(errno, std::generic_category()));
+ }
+
+ LLDB_LOG(log, "adding tid = {0}", tid);
+ it->second = true;
+
+ // move the loop forward
+ ++it;
+ }
+ }
+
+ size_t tid_count = tids_to_attach.size();
+ if (tid_count == 0)
+ return llvm::make_error<StringError>("No such process",
+ llvm::inconvertibleErrorCode());
+
+ std::vector<::pid_t> tids;
+ tids.reserve(tid_count);
+ for (const auto &p : tids_to_attach)
+ tids.push_back(p.first);
+
+ return std::move(tids);
+}
+
llvm::Expected<std::vector<::pid_t>> NativeProcessLinux::Attach(::pid_t pid) {
Log *log = GetLog(POSIXLog::Process);
@@ -513,8 +611,8 @@ llvm::Expected<std::vector<::pid_t>> NativeProcessLinux::Attach(::pid_t pid) {
return std::move(tids);
}
-Status NativeProcessLinux::SetDefaultPtraceOpts(lldb::pid_t pid) {
- long ptrace_opts = 0;
+uint64_t NativeProcessLinux::GetDefaultPtraceOpts() {
+ uint64_t ptrace_opts = 0;
// Have the child raise an event on exit. This is used to keep the child in
// limbo until it is destroyed.
@@ -537,6 +635,11 @@ Status NativeProcessLinux::SetDefaultPtraceOpts(lldb::pid_t pid) {
// the child finishes sharing memory.
ptrace_opts |= PTRACE_O_TRACEVFORKDONE;
+ return ptrace_opts;
+}
+
+Status NativeProcessLinux::SetDefaultPtraceOpts(lldb::pid_t pid) {
+ uint64_t ptrace_opts = GetDefaultPtraceOpts();
return PtraceWrapper(PTRACE_SETOPTIONS, pid, nullptr, (void *)ptrace_opts);
}
diff --git a/lldb/source/Plugins/Process/Linux/NativeProcessLinux.h b/lldb/source/Plugins/Process/Linux/NativeProcessLinux.h
index d345f165a75d8..9ae4e57e74add 100644
--- a/lldb/source/Plugins/Process/Linux/NativeProcessLinux.h
+++ b/lldb/source/Plugins/Process/Linux/NativeProcessLinux.h
@@ -175,7 +175,6 @@ class NativeProcessLinux : public NativeProcessELF,
private:
Manager &m_manager;
ArchSpec m_arch;
-
LazyBool m_supports_mem_region = eLazyBoolCalculate;
std::vector<std::pair<MemoryRegionInfo, FileSpec>> m_mem_region_cache;
@@ -191,9 +190,13 @@ class NativeProcessLinux : public NativeProcessELF,
// Returns a list of process threads that we have attached to.
static llvm::Expected<std::vector<::pid_t>> Attach(::pid_t pid);
+ // Returns a list of process threads that we have seized and interrupted.
+ static llvm::Expected<std::vector<::pid_t>> Seize(::pid_t pid);
static Status SetDefaultPtraceOpts(const lldb::pid_t);
+ static uint64_t GetDefaultPtraceOpts();
+
bool TryHandleWaitStatus(lldb::pid_t pid, WaitStatus status);
void MonitorCallback(NativeThreadLinux &thread, WaitStatus status);
|
@labath @DavidSpickett Thanks for the patience. I've broken down my prototype, and this is now patch 2, where we implement the SEIZE functionality. I will follow up with making it so we can't resume the process when we're in this state. @DavidSpickett you mentioned you wanted me to include my gist, to my knowledge github will include my summary by default. Did you want me to check in the toy program as an example? |
You can include your gist content in the PR summary, I've certainly seen longer commit messages than that in llvm :) The question I want to be able to answer from the final commit message is "what is a dead process and how do I make one?". So that in future if I need to investigate this code, I know where to start. You can do that by linking to well established documentation on the subject, or writing it in your own words, and/or including that example. If we are able to construct a test case, that would serve the same purpose. |
Also FYI this has CI failures, https://buildkite.com/llvm-project/github-pull-requests/builds/177176#0196b180-e7c0-49e9-99ac-65dcc8a3c1a9, not sure if you are aware. LLDB isn't known for being super stable in pre-commit CI, but they are on the same topic as this. |
5c25880
to
68077c2
Compare
@DavidSpickett fixed the test, I gotcha'd myself checking an optional of a bool. I added my gist to the description, let me know what you think |
This part looks good, that'll be enough to test this / explain why it exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave the discussion of race conditions or lack of to @labath .
auto attached_or = NativeProcessLinux::Attach(pid); | ||
if (!attached_or) | ||
return attached_or.takeError(); | ||
tids = std::move(*attached_or); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refactor this to have only one set of if (!... ... tids =
lines? Maybe not because the auto
would resolve to different things.
Maybe you can do:
auto attached_or = is_core_dumping ? NativeProcessLinux::Seize(pid) : NativeProcessLinux::Attach(pid);
uint64_t options = GetDefaultPtraceOpts(); | ||
Status status; | ||
// Use a map to keep track of the threads which we have attached/need to | ||
// attach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seized / need to seize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact why does this say attach a lot. Are we attaching by seizing, or are seize and attach two different things completely?
return status.ToError(); | ||
} | ||
|
||
if ((status = PtraceWrapper(PTRACE_INTERRUPT, tid)).Fail()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shame that we can't do these two PtraceWrapper calls like:
if (status = PtraceWrapper(PTRACE_SEIZE).Fail() || status = PTraceWrapper(PTRACE_INTERRUPT).Fail())
<the error handling bit>
Maybe something with ,
could do it, I'm not too familiar with that.
Another way to do it is:
status = PTraceWrapper(PPTRACE_SEIZE);
if (!status.Fail())
status = PTraceWrapper(PTRACE_INTERRUPT)
if (status.Fail())
<common error handling>
|
||
int wpid = | ||
llvm::sys::RetryAfterSignal(-1, ::waitpid, tid, nullptr, __WALL); | ||
// Need to use __WALL otherwise we receive an error with errno=ECHLD At |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this comment to before wpid =
. Usual practice is to comment before doing something.
llvm::sys::RetryAfterSignal(-1, ::waitpid, tid, nullptr, __WALL); | ||
// Need to use __WALL otherwise we receive an error with errno=ECHLD At | ||
// this point we should have a thread stopped if waitpid succeeds. | ||
if (wpid < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be if (llvm::sys.....__WALL) < 0)
, since wpid is not used outside of this check.
it = tids_to_attach.erase(it); | ||
continue; | ||
} | ||
return llvm::errorCodeToError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are about to return, is there any need to update tids_to_attach?
Same applies to the code above. If tids_to_attach is a copy local to this function, we don't need to clean it up before returning.
it->second = true; | ||
|
||
// move the loop forward | ||
++it; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop is erasing map entries as it goes, is this safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[I'm sorry, this is going to be long]
Most of David's comments are there because you copied the existing attach implementation. Even if we go down the path of different attach methods, I don't think it's necessary to copy all of that. Actual attaching is a small part of that function. The rest of the stuff, which deals with looping, gathering thread ids, and error messages, could be handled by common code.
I say if because it's not clear to me that we've given up on using the common code path we discussed on the rfc thread. I think it could look something like this:
ptrace(PTRACE_SEIZE, pid, 0, (void*)(PTRACE_O_TRACEEXIT));
syscall(SYS_tgkill, pid, pid, SIGSTOP);
ptrace(PTRACE_INTERRUPT, pid, 0, 0);
int status;
waitpid(pid, &status, __WALL);
if (status>>8 == (SIGTRAP | PTRACE_EVENT_STOP << 8)) {
ptrace(PTRACE_CONT, pid, 0, 0);
waitpid(pid, &status, __WALL);
}
If you run this on a "regular" process you get this sequence of events:
ptrace(PTRACE_SEIZE, 31169, NULL, PTRACE_O_TRACEEXIT) = 0
tgkill(31169, 31169, SIGSTOP) = 0
ptrace(PTRACE_INTERRUPT, 31169) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=31169, si_uid=1000, si_status=SIGSTOP, si_utime=0, si_stime=0} ---
wait4(31169, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSTOP}], __WALL, NULL) = 31169
.. or this one:
ptrace(PTRACE_SEIZE, 31169, NULL, PTRACE_O_TRACEEXIT) = 0
tgkill(31169, 31169, SIGSTOP) = 0
ptrace(PTRACE_INTERRUPT, 31169) = 0
wait4(31169, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}|PTRACE_EVENT_STOP<<16], __WALL, NULL) = 31169
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_STOPPED, si_pid=31169, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
ptrace(PTRACE_CONT, 31169, NULL, 0) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=31169, si_uid=1000, si_status=SIGSTOP, si_utime=0, si_stime=0} ---
wait4(31169, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSTOP}], __WALL, NULL) = 31169
The difference is in whether the kernel is fast enough to notice the SIGSTOP before it PTRACE_INTERRUPTS the process (the fact that this results in nondeterministic behavior might be worth filing a bug report for the kernel). With this simple code, I usually get the second version, but this is really a matter of microseconds. Even inserting usleep(0)
between the SIGSTOP and PTRACE_INTERRUPT calls is enough to make the process immediately stop with SIGSTOP.
However, that doesn't matter. After this code is done, the process is stopped with a SIGSTOP, just like it was in the PTRACE_ATTACH case. For a "dead" process, the sequence slightly different:
ptrace(PTRACE_SEIZE, 31169, NULL, PTRACE_O_TRACEEXIT) = 0
tgkill(31169, 31169, SIGSTOP) = 0
ptrace(PTRACE_INTERRUPT, 31169) = 0
wait4(31169, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}|PTRACE_EVENT_EXIT<<16], __WALL, NULL) = 31169
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=31169, si_uid=1000, si_status=SIGTRAP, si_utime=0, si_stime=0} ---
In this case you always get a PTRACE_EVENT_EXIT stop.
The nicest part about this is that could be used to distinguish between the two process states -- without the core dumping flag, the proc filesystem, or anything else. I think it's theoretically possible to catch a live thread while it's exiting and get a PTRACE_EVENT_EXIT, but getting that for all threads in the process is very unlikely.
Or, instead of making this a process wide-property, we could make this a property of a specific thread. We could say that a thread that is in an "exiting" state is not allowed to run expressions. And this dead process would just be a special case of a process which happens to have all threads in the exiting state.
And the nicest part about that is that this is extremely close to being testable. We no longer need a core-dumping process -- we just need a process that's stopped in a PTRACE_EVENT_EXIT. Right now, we don't have a way to do that, but we're really close to that. One way to do that would be to let the user request stopping a thread at exit. If we implement all of the rest, this will be mainly a UX question of how we want to expose this to the user. I think we could make some sort of an experimental setting for that, or even just some secret packet that we send with the process plugin packet
command.
What do you say to all this? I realize I am sort of asking you my pet feature, but I think that by framing your feature (debugging of dead processes) as a special case of this (debugging exiting processes), we can avoid most of the problematic issues with this PR (divergence of implementation, impossibility of testing).
|
||
std::vector<::pid_t> tids; | ||
// IsCoreDumping is an optional, so check for value then true/false. | ||
if (process_info.IsCoreDumping() && *process_info.IsCoreDumping()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (process_info.IsCoreDumping() && *process_info.IsCoreDumping()) { | |
if (process_info.IsCoreDumping() == true) { |
I'm in! I always thought this was a great segue for |
This the actual PR to my SEIZE RFC. This is currently the bare bones on seizing a dead process, and being able to attach and introspect with LLDB.
Additionally, right now I only check proc status before seize, and we should double check after seize that the process has not changed. Worth noting is once you seize a coredumping process (and it hits trace stop), Coredumping in status will now report 0.
This is pretty complicated to test because it requires integration with the Kernel, thankfully the setup only involves some very simple toy programs.
We will want to create a program to be our 'coredumper' and then hold the pipe until we send a CONT
Then once we compile/build our test coredumper, we will need to configure the kernel to pipe to our core holder
This sets the kernel to pipe to our program, with %p getting replaced with the pid.
Then lastly we need a program to throw a signal
So then we can attach lldb to this coreing program!