Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Jlalond
Copy link
Contributor

@Jlalond Jlalond commented Apr 23, 2025

This the actual PR to my SEIZE RFC. This is currently the bare bones on seizing a dead process, and being able to attach and introspect with LLDB.

Additionally, right now I only check proc status before seize, and we should double check after seize that the process has not changed. Worth noting is once you seize a coredumping process (and it hits trace stop), Coredumping in status will now report 0.

This is pretty complicated to test because it requires integration with the Kernel, thankfully the setup only involves some very simple toy programs.

We will want to create a program to be our 'coredumper' and then hold the pipe until we send a CONT

ptrace.c
// We use this coredumper to nicely hold a pipe
// so we can test if lldb can attach to the dead process.
// to kill this program run
// sudo pkill "ptrace" -CONT
int main(int argc, char* argv[]) {
  raise(SIGSTOP);
  fclose(STDIN_FILENO);
  raise(SIGSTOP);
  return 0;
}

Then once we compile/build our test coredumper, we will need to configure the kernel to pipe to our core holder

set_up_ptrace.sh
// One word of warning, this path needs to be < 128b bytes to my knowledge
echo "|{path_to_our_toy_coredumper} %p" | sudo tee /proc/sys/kernel/core_pattern > /dev/null

This sets the kernel to pipe to our program, with %p getting replaced with the pid.

Then lastly we need a program to throw a signal

sigabrt.c
// Our mini sigabrt program, I call it sigabrt
int main() {
  int x = 42; // the meaning of life
  int y = 0;
  int result = x / y; // raises sigabrt
  return 0;
}

So then we can attach lldb to this coreing program!

@labath
Copy link
Collaborator

labath commented Apr 24, 2025

We already have one piece of "status" parsing code in source/Host/linux/Host.cpp. I think it'd be better to reuse that one. I'm slightly torn as to whether reuse Host::GetProcessInfo for this (and add a new field to ProcessInstanceInfo -- or possibly expand on IsZombie), or whether to create a new linux-specific entry point which will return this data.

Some caveats that I need to address before we publish this PR is how to prevent LLDB from running any expressions or really anything that trys to SIGCONT, because that will immediately terminate the process, I would like this behavior to mimic how we inform the user post mortem processes can't run expressions.

I don't know the answer to that, but I can say that I don't think this feature needs to be (or should be) specific to this use case. One of the things that I would like to be able to do is to stop a process right before it exits (regardless of whether that's through the exit syscall, or a fatal signal, etc.). PTRACE_O_TRACEEXIT lets you do that, but it means the process will end up in the same "almost a zombie" state, where any attempt to resume it will cause it to disappear. If we had a mechanism to prevent this, we could use it in this case as well. (and this case, unlike this "dead" state, is actually testable).

I think the tricky part is that (in both cases) the user might legitimately want to let the process exit, and "continue" is the normal way to do that, so I don't think we'd want to just error out of the continue command (or from the vCont packet). I think what we'd want is to make sure that the process doesn't accidentally exit while running an expression (possibly from within a data formatter), and for that I guess we'd need to let lldb know that running expressions is "dangerous". We already have Thread::SafeToCallFunctions, even though it's used for a slightly different purpose, but maybe it could be extended to handle this as well?

@Jlalond
Copy link
Contributor Author

Jlalond commented Apr 24, 2025

I think the tricky part is that (in both cases) the user might legitimately want to let the process exit, and "continue" is the normal way to do that, so I don't think we'd want to just error out of the continue command (or from the vCont packet). I think what we'd want is to make sure that the process doesn't accidentally exit while running an expression (possibly from within a data formatter), and for that I guess we'd need to let lldb know that running expressions is "dangerous". We already have Thread::SafeToCallFunctions, even though it's used for a slightly different purpose, but maybe it could be extended to handle this as well?

I think disallowing any non explicit continues/disconnect is a good user experience as long as we display an appropriate message. The workflow I imagine is when halted in this state any explicit continue or disconnect should just kill the process, but something like p MyVar.Size() should not.

We already have one piece of "status" parsing code in source/Host/linux/Host.cpp. I think it'd be better to reuse that one. I'm slightly torn as to whether reuse Host::GetProcessInfo for this (and add a new field to ProcessInstanceInfo -- or possibly expand on IsZombie), or whether to create a new linux-specific entry point which will return this data.

Will refactor, I looked for something for status and it seems I missed something

@Jlalond Jlalond changed the title Ptrace seize dead process [LLDB] Ptrace seize dead process Apr 24, 2025
@Jlalond Jlalond force-pushed the ptrace-seize-dead-process branch 2 times, most recently from 3b10fcd to f1574f3 Compare April 28, 2025 21:23
@labath
Copy link
Collaborator

labath commented Apr 29, 2025

I see this is still a draft, but to avoid surprised, I want to say that I think this should be two or three patches in the final form. One for the PTRACE_SEIZE thingy, one for the "mechanism to prevent a process from resuming" and maybe (depending on how involved it gets) one for refactoring the /proc/status parser.

@Jlalond
Copy link
Contributor Author

Jlalond commented Apr 29, 2025

I see this is still a draft, but to avoid surprised, I want to say that I think this should be two or three patches in the final form. One for the PTRACE_SEIZE thingy, one for the "mechanism to prevent a process from resuming" and maybe (depending on how involved it gets) one for refactoring the /proc/status parser.

I'm okay with that, I'm still in the 'experiment and see what happens phase' when it comes to preventing continue.

How does this proposal sound:

  • SEIZE + Parsing Proc Status
  • GDB Server changes to prevent resumption
  • Move the Proc Status (not stat) code to the HOST class

For #3, I think it's got some loose scope around if it should replace proc stat or be in addition to it. The biggest complexity here is we're adding information into qProcessInfo that isn't exclusively about the process but now about how we're interacting with the process. So I think tackling that as it's own step makes sense.

@dmpots dmpots self-requested a review April 29, 2025 20:32
@DavidSpickett
Copy link
Collaborator

Also if/when you commit parts of this, please include a version of the example gist in one of the commit messages, it might be useful in future.

@DavidSpickett
Copy link
Collaborator

This is pretty complicated to test because it requires integration with the Kernel

Can you make the same thing happen without using a coredumper? I feel like the answer is a solid no but I'm not sure why.

Another way we can do it is to write a test that checks that if the remote says it's in a non-resumable state, we act in a certain way. Only half the story but it's something.

response.Printf("ptrsize:%d;", proc_arch.GetAddressByteSize());
std::optional<bool> non_resumable = proc_info.IsNonResumable();
if (non_resumable)
response.Printf("non_resumable:%d", *non_resumable);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of qProcessInfo (https://lldb.llvm.org/resources/lldbgdbremote.html#qprocessinfo) which is, I presume, only requested once because all the information is constant for the process lifetime.

At least for the situation at hand, non-resumeable is also constant. Though the process had to get into that state somehow, but if you were debugging it before the non-resumable point, it wouldn't have got into the non-resumeable state anyway so it makes no difference.

So unless anyone can think of a situation where non-resumeable could change, this packet is probably fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your intuition is correct, for now this works, but in the future (if we want to support O_TRACEEXIT), we would need to update this. Currently I can get away with attach returning a process info that we can't resume.

This is still very WIP, as I'm trying to sort with Greg the gotcha's. I will break this patch up into pieces soon :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, so something would get added to the remote protocol to make this work but exactly what we can decide later.

@labath
Copy link
Collaborator

labath commented May 5, 2025

Move the Proc Status (not stat) code to the HOST class

I'd put this first (in which case it wouldn't be called "move" but "extend" or "refactor"), for two reasons:

  • it reduces the chance of ending up with two parsers
  • I'm not very happy with the implementation you have here. I think using structured data is overkill and makes using it more complicated. Since this is an internal API, and we don't have to worry about stability, I think a struct with a bool field (or optional<bool> if you need to treat "not present" differently) would be better. (That's also more-or-less what the existing implementation does)

@Jlalond Jlalond force-pushed the ptrace-seize-dead-process branch 4 times, most recently from 8f6bb44 to 5c25880 Compare May 8, 2025 20:05
@Jlalond Jlalond marked this pull request as ready for review May 8, 2025 20:06
@llvmbot llvmbot added the lldb label May 8, 2025
@llvmbot
Copy link
Member

llvmbot commented May 8, 2025

@llvm/pr-subscribers-lldb

Author: Jacob Lalonde (Jlalond)

Changes

This the actual PR to my SEIZE RFC. This is currently the bare bones on seizing a dead process, and being able to attach and introspect with LLDB.

Additionally, right now I only check proc status before seize, and we should double check after seize that the process has not changed. Worth noting is once you seize a coredumping process (and it hits trace stop), Coredumping in status will now report 0.

This is pretty complicated to test because it requires integration with the Kernel, thankfully the setup only involves some very simple toy programs, which I have outlined with instructions in this gist


Full diff: https://github.com/llvm/llvm-project/pull/137041.diff

2 Files Affected:

  • (modified) lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp (+109-6)
  • (modified) lldb/source/Plugins/Process/Linux/NativeProcessLinux.h (+4-1)
diff --git a/lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp b/lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
index 7f2aba0e4eb2c..141e49d8a0b7e 100644
--- a/lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
+++ b/lldb/source/Plugins/Process/Linux/NativeProcessLinux.cpp
@@ -312,10 +312,26 @@ NativeProcessLinux::Manager::Attach(
   Log *log = GetLog(POSIXLog::Process);
   LLDB_LOG(log, "pid = {0:x}", pid);
 
-  auto tids_or = NativeProcessLinux::Attach(pid);
-  if (!tids_or)
-    return tids_or.takeError();
-  ArrayRef<::pid_t> tids = *tids_or;
+  // This safety check lets us decide if we should
+  // seize or attach.
+  ProcessInstanceInfo process_info;
+  if (!Host::GetProcessInfo(pid, process_info))
+    return llvm::make_error<StringError>("Unable to read process info",
+                                         llvm::inconvertibleErrorCode());
+
+  std::vector<::pid_t> tids;
+  if (process_info.IsCoreDumping()) {
+    auto attached_or = NativeProcessLinux::Seize(pid);
+    if (!attached_or)
+      return attached_or.takeError();
+    tids = std::move(*attached_or);
+  } else {
+    auto attached_or = NativeProcessLinux::Attach(pid);
+    if (!attached_or)
+      return attached_or.takeError();
+    tids = std::move(*attached_or);
+  }
+
   llvm::Expected<ArchSpec> arch_or =
       NativeRegisterContextLinux::DetermineArchitecture(tids[0]);
   if (!arch_or)
@@ -444,6 +460,88 @@ NativeProcessLinux::NativeProcessLinux(::pid_t pid, int terminal_fd,
   SetState(StateType::eStateStopped, false);
 }
 
+llvm::Expected<std::vector<::pid_t>> NativeProcessLinux::Seize(::pid_t pid) {
+  Log *log = GetLog(POSIXLog::Process);
+
+  uint64_t options = GetDefaultPtraceOpts();
+  Status status;
+  // Use a map to keep track of the threads which we have attached/need to
+  // attach.
+  Host::TidMap tids_to_attach;
+  while (Host::FindProcessThreads(pid, tids_to_attach)) {
+    for (Host::TidMap::iterator it = tids_to_attach.begin();
+         it != tids_to_attach.end();) {
+      if (it->second == true) {
+        continue;
+      }
+      lldb::tid_t tid = it->first;
+      if ((status = PtraceWrapper(PTRACE_SEIZE, tid, nullptr, (void *)options))
+              .Fail()) {
+        // No such thread. The thread may have exited. More error handling
+        // may be needed.
+        if (status.GetError() == ESRCH) {
+          it = tids_to_attach.erase(it);
+          continue;
+        }
+        if (status.GetError() == EPERM) {
+          // Depending on the value of ptrace_scope, we can return a
+          // different error that suggests how to fix it.
+          return AddPtraceScopeNote(status.ToError());
+        }
+        return status.ToError();
+      }
+
+      if ((status = PtraceWrapper(PTRACE_INTERRUPT, tid)).Fail()) {
+        // No such thread. The thread may have exited. More error handling
+        // may be needed.
+        if (status.GetError() == ESRCH) {
+          it = tids_to_attach.erase(it);
+          continue;
+        }
+        if (status.GetError() == EPERM) {
+          // Depending on the value of ptrace_scope, we can return a
+          // different error that suggests how to fix it.
+          return AddPtraceScopeNote(status.ToError());
+        }
+        return status.ToError();
+      }
+
+      int wpid =
+          llvm::sys::RetryAfterSignal(-1, ::waitpid, tid, nullptr, __WALL);
+      // Need to use __WALL otherwise we receive an error with errno=ECHLD At
+      // this point we should have a thread stopped if waitpid succeeds.
+      if (wpid < 0) {
+        // No such thread. The thread may have exited. More error handling
+        // may be needed.
+        if (errno == ESRCH) {
+          it = tids_to_attach.erase(it);
+          continue;
+        }
+        return llvm::errorCodeToError(
+            std::error_code(errno, std::generic_category()));
+      }
+
+      LLDB_LOG(log, "adding tid = {0}", tid);
+      it->second = true;
+
+      // move the loop forward
+      ++it;
+    }
+  }
+
+  size_t tid_count = tids_to_attach.size();
+  if (tid_count == 0)
+    return llvm::make_error<StringError>("No such process",
+                                         llvm::inconvertibleErrorCode());
+
+  std::vector<::pid_t> tids;
+  tids.reserve(tid_count);
+  for (const auto &p : tids_to_attach)
+    tids.push_back(p.first);
+
+  return std::move(tids);
+}
+
 llvm::Expected<std::vector<::pid_t>> NativeProcessLinux::Attach(::pid_t pid) {
   Log *log = GetLog(POSIXLog::Process);
 
@@ -513,8 +611,8 @@ llvm::Expected<std::vector<::pid_t>> NativeProcessLinux::Attach(::pid_t pid) {
   return std::move(tids);
 }
 
-Status NativeProcessLinux::SetDefaultPtraceOpts(lldb::pid_t pid) {
-  long ptrace_opts = 0;
+uint64_t NativeProcessLinux::GetDefaultPtraceOpts() {
+  uint64_t ptrace_opts = 0;
 
   // Have the child raise an event on exit.  This is used to keep the child in
   // limbo until it is destroyed.
@@ -537,6 +635,11 @@ Status NativeProcessLinux::SetDefaultPtraceOpts(lldb::pid_t pid) {
   // the child finishes sharing memory.
   ptrace_opts |= PTRACE_O_TRACEVFORKDONE;
 
+  return ptrace_opts;
+}
+
+Status NativeProcessLinux::SetDefaultPtraceOpts(lldb::pid_t pid) {
+  uint64_t ptrace_opts = GetDefaultPtraceOpts();
   return PtraceWrapper(PTRACE_SETOPTIONS, pid, nullptr, (void *)ptrace_opts);
 }
 
diff --git a/lldb/source/Plugins/Process/Linux/NativeProcessLinux.h b/lldb/source/Plugins/Process/Linux/NativeProcessLinux.h
index d345f165a75d8..9ae4e57e74add 100644
--- a/lldb/source/Plugins/Process/Linux/NativeProcessLinux.h
+++ b/lldb/source/Plugins/Process/Linux/NativeProcessLinux.h
@@ -175,7 +175,6 @@ class NativeProcessLinux : public NativeProcessELF,
 private:
   Manager &m_manager;
   ArchSpec m_arch;
-
   LazyBool m_supports_mem_region = eLazyBoolCalculate;
   std::vector<std::pair<MemoryRegionInfo, FileSpec>> m_mem_region_cache;
 
@@ -191,9 +190,13 @@ class NativeProcessLinux : public NativeProcessELF,
 
   // Returns a list of process threads that we have attached to.
   static llvm::Expected<std::vector<::pid_t>> Attach(::pid_t pid);
+  // Returns a list of process threads that we have seized and interrupted.
+  static llvm::Expected<std::vector<::pid_t>> Seize(::pid_t pid);
 
   static Status SetDefaultPtraceOpts(const lldb::pid_t);
 
+  static uint64_t GetDefaultPtraceOpts();
+
   bool TryHandleWaitStatus(lldb::pid_t pid, WaitStatus status);
 
   void MonitorCallback(NativeThreadLinux &thread, WaitStatus status);

@Jlalond
Copy link
Contributor Author

Jlalond commented May 8, 2025

@labath @DavidSpickett Thanks for the patience.

I've broken down my prototype, and this is now patch 2, where we implement the SEIZE functionality. I will follow up with making it so we can't resume the process when we're in this state.

@DavidSpickett you mentioned you wanted me to include my gist, to my knowledge github will include my summary by default. Did you want me to check in the toy program as an example?

@DavidSpickett
Copy link
Collaborator

You can include your gist content in the PR summary, I've certainly seen longer commit messages than that in llvm :)

The question I want to be able to answer from the final commit message is "what is a dead process and how do I make one?". So that in future if I need to investigate this code, I know where to start. You can do that by linking to well established documentation on the subject, or writing it in your own words, and/or including that example.

If we are able to construct a test case, that would serve the same purpose.

@DavidSpickett
Copy link
Collaborator

Also FYI this has CI failures, https://buildkite.com/llvm-project/github-pull-requests/builds/177176#0196b180-e7c0-49e9-99ac-65dcc8a3c1a9, not sure if you are aware.

LLDB isn't known for being super stable in pre-commit CI, but they are on the same topic as this.

@Jlalond Jlalond force-pushed the ptrace-seize-dead-process branch from 5c25880 to 68077c2 Compare May 20, 2025 16:20
@Jlalond
Copy link
Contributor Author

Jlalond commented May 20, 2025

@DavidSpickett fixed the test, I gotcha'd myself checking an optional of a bool. I added my gist to the description, let me know what you think

@DavidSpickett DavidSpickett changed the title [LLDB] Ptrace seize dead process [LLDB] Ptrace seize dead processes on Linux May 21, 2025
@DavidSpickett
Copy link
Collaborator

I added my gist to the description, let me know what you think

This part looks good, that'll be enough to test this / explain why it exists.

Copy link
Collaborator

@DavidSpickett DavidSpickett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave the discussion of race conditions or lack of to @labath .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refactor this to have only one set of if (!... ... tids = lines? Maybe not because the auto would resolve to different things.

Maybe you can do:

auto attached_or = is_core_dumping ? NativeProcessLinux::Seize(pid) : NativeProcessLinux::Attach(pid);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seized / need to seize

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact why does this say attach a lot. Are we attaching by seizing, or are seize and attach two different things completely?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shame that we can't do these two PtraceWrapper calls like:

if (status = PtraceWrapper(PTRACE_SEIZE).Fail() || status = PTraceWrapper(PTRACE_INTERRUPT).Fail())
  <the error handling bit>

Maybe something with , could do it, I'm not too familiar with that.

Another way to do it is:

status = PTraceWrapper(PPTRACE_SEIZE);
if (!status.Fail())
  status = PTraceWrapper(PTRACE_INTERRUPT)
if (status.Fail())
  <common error handling>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this comment to before wpid =. Usual practice is to comment before doing something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be if (llvm::sys.....__WALL) < 0), since wpid is not used outside of this check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are about to return, is there any need to update tids_to_attach?

Same applies to the code above. If tids_to_attach is a copy local to this function, we don't need to clean it up before returning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is erasing map entries as it goes, is this safe?

Copy link
Collaborator

@labath labath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[I'm sorry, this is going to be long]

Most of David's comments are there because you copied the existing attach implementation. Even if we go down the path of different attach methods, I don't think it's necessary to copy all of that. Actual attaching is a small part of that function. The rest of the stuff, which deals with looping, gathering thread ids, and error messages, could be handled by common code.

I say if because it's not clear to me that we've given up on using the common code path we discussed on the rfc thread. I think it could look something like this:

  ptrace(PTRACE_SEIZE, pid, 0, (void*)(PTRACE_O_TRACEEXIT));
  syscall(SYS_tgkill, pid, pid, SIGSTOP);
  ptrace(PTRACE_INTERRUPT, pid, 0, 0);
  int status;
  waitpid(pid, &status, __WALL);
  if (status>>8 == (SIGTRAP | PTRACE_EVENT_STOP << 8)) {
    ptrace(PTRACE_CONT, pid, 0, 0);
    waitpid(pid, &status, __WALL);
  }

If you run this on a "regular" process you get this sequence of events:

ptrace(PTRACE_SEIZE, 31169, NULL, PTRACE_O_TRACEEXIT) = 0
tgkill(31169, 31169, SIGSTOP)           = 0
ptrace(PTRACE_INTERRUPT, 31169)         = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=31169, si_uid=1000, si_status=SIGSTOP, si_utime=0, si_stime=0} ---
wait4(31169, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSTOP}], __WALL, NULL) = 31169

.. or this one:

ptrace(PTRACE_SEIZE, 31169, NULL, PTRACE_O_TRACEEXIT) = 0
tgkill(31169, 31169, SIGSTOP)           = 0
ptrace(PTRACE_INTERRUPT, 31169)         = 0
wait4(31169, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}|PTRACE_EVENT_STOP<<16], __WALL, NULL) = 31169
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_STOPPED, si_pid=31169, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
ptrace(PTRACE_CONT, 31169, NULL, 0)     = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=31169, si_uid=1000, si_status=SIGSTOP, si_utime=0, si_stime=0} ---
wait4(31169, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSTOP}], __WALL, NULL) = 31169

The difference is in whether the kernel is fast enough to notice the SIGSTOP before it PTRACE_INTERRUPTS the process (the fact that this results in nondeterministic behavior might be worth filing a bug report for the kernel). With this simple code, I usually get the second version, but this is really a matter of microseconds. Even inserting usleep(0) between the SIGSTOP and PTRACE_INTERRUPT calls is enough to make the process immediately stop with SIGSTOP.

However, that doesn't matter. After this code is done, the process is stopped with a SIGSTOP, just like it was in the PTRACE_ATTACH case. For a "dead" process, the sequence slightly different:

ptrace(PTRACE_SEIZE, 31169, NULL, PTRACE_O_TRACEEXIT) = 0
tgkill(31169, 31169, SIGSTOP)           = 0
ptrace(PTRACE_INTERRUPT, 31169)         = 0
wait4(31169, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}|PTRACE_EVENT_EXIT<<16], __WALL, NULL) = 31169
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=31169, si_uid=1000, si_status=SIGTRAP, si_utime=0, si_stime=0} ---

In this case you always get a PTRACE_EVENT_EXIT stop.

The nicest part about this is that could be used to distinguish between the two process states -- without the core dumping flag, the proc filesystem, or anything else. I think it's theoretically possible to catch a live thread while it's exiting and get a PTRACE_EVENT_EXIT, but getting that for all threads in the process is very unlikely.

Or, instead of making this a process wide-property, we could make this a property of a specific thread. We could say that a thread that is in an "exiting" state is not allowed to run expressions. And this dead process would just be a special case of a process which happens to have all threads in the exiting state.

And the nicest part about that is that this is extremely close to being testable. We no longer need a core-dumping process -- we just need a process that's stopped in a PTRACE_EVENT_EXIT. Right now, we don't have a way to do that, but we're really close to that. One way to do that would be to let the user request stopping a thread at exit. If we implement all of the rest, this will be mainly a UX question of how we want to expose this to the user. I think we could make some sort of an experimental setting for that, or even just some secret packet that we send with the process plugin packet command.

What do you say to all this? I realize I am sort of asking you my pet feature, but I think that by framing your feature (debugging of dead processes) as a special case of this (debugging exiting processes), we can avoid most of the problematic issues with this PR (divergence of implementation, impossibility of testing).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (process_info.IsCoreDumping() && *process_info.IsCoreDumping()) {
if (process_info.IsCoreDumping() == true) {

@Jlalond
Copy link
Contributor Author

Jlalond commented May 28, 2025

What do you say to all this? I realize I am sort of asking you my pet feature, but I think that by framing your feature (debugging of dead processes) as a special case of this (debugging exiting processes), we can avoid most of the problematic issues with this PR (divergence of implementation, impossibility of testing).

I'm in! I always thought this was a great segue for PTRACE_O_TRACEEXIT as a feature. We recently had someone else leave our team and I'm diverting to finish up some of the ELF Core work (hence you've seen my patches), but once I get that wrapped up I'll be back working on this.

@Jlalond Jlalond force-pushed the ptrace-seize-dead-process branch 2 times, most recently from 5840671 to 39d40d1 Compare July 15, 2025 22:09
@github-actions
Copy link

github-actions bot commented Jul 15, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@Jlalond Jlalond force-pushed the ptrace-seize-dead-process branch from 39d40d1 to c7f50e6 Compare July 15, 2025 22:24
@Jlalond
Copy link
Contributor Author

Jlalond commented Jul 16, 2025

@labath I'm back to working on this!

My only questions are going to be aligning on implementation. Currently today we effectively ignore the stops from trace_exit (despite setting the option), leveraging the event mostly to clean up the program.

I think we have two approaches, I'm open to either but I may be ignorant to some details about the gdb remote protocol.

  1. We return a process info whenever we attach, we could return the flag for coredumping here and let the GDB remote client know that this process is not safe to resume, only to be exited. We can also set CanJit false to prevent trying to Jit in the zombie state. In the future we could extend this to send the trace_exit state and do the same, dependent on some setting.

  2. We have the gdb server send a packet if it's safe to resume. I'm pretty ignorant about this but it lets us keep the process logic in gdb-server which I do appreciate. My concern here is we'd need to propagate some user setting and I'm unsure or unaware how to do that, code pointers would be appreciated.

Assuming we have an extant system to send settings I prefer option 2, option 1 is similar to how I landed my internal patch when I first attempted this. @DavidSpickett I'd also appreciate your insight

@labath
Copy link
Collaborator

labath commented Jul 24, 2025

Sorry about the delay. I'm not sure I understand the question, but the way I'm imagining this is to let the server communicate the "is it safe to resume" bit to the client. After that, the client can do whatever it wants/needs (e.g. disable jitting/expression evaluation).

I think this is closest to your option one, except that I would make this a per-thread property, for example via a new field in the qThreadStopInfo packet. Then, the core-dumping process would just be special case of a process whose all threads are exiting.

I'm not sure I understand the second option, but we don't really have a way for the server to send unsolicited packets to the client -- but I also don't see why we would need to send them.

As for the "in the future" part, I think it would be easier to implement most of that feature now, because:

  • that feature (unlike core dumping) is testable
  • it requires solving most of the same problems, like how do we prevent from resuming the application unintentionally, and how do we let the user know that the application is in this non-resumable state

The only extra thing we'd need to solve for that is how does the user actually request that the application stops before exiting -- and I'd be fine if we postponed the creation of the UI for that. However, that doesn't mean we can't create some sort of a back door for testing. For example, we can add a new packet to lldb-server (QEnableExitStops?), but don't hook it up to any user-facing commands. The tests (and power users) can still access it via the "process plugin packet" command. The test for handling of this state would consist of sending this packet, resuming the application and then waiting for it to stop in the "pre-exit" state. A core dumping process would just be a special case which is in the pre-exit state immediately after attaching.

imyixinw pushed a commit to imyixinw/llvm-project that referenced this pull request Sep 5, 2025
… prototype for ptrace seize

Summary:
This is an internal version of my upstream PR [llvm#137041](llvm#137041) where I'm hacking in PTRACE_SEIZE support for Coredumping processes, and then preventing them from being resumed. We only allow exiting, I've explained this in further detail in my test plan.

This is landing internally first, to give the upstream PR and RFC more time to mature and identify problems or new capabilities we could build on top of this workflow.

Test Plan:
The setup to test this is quite convoluted, and I explain in greater detail in my upstream PR, but to summarize here

I compile this program to hold the kernel provided pipe
```
// Set up the corepipe to our program
echo "|/data/users/jalalonde/sand_test_code/ptrace.out %p" | sudo tee /proc/sys/kernel/core_pattern > /dev/null

Invoke the following:
int main() {
  return 42 / 0;
}
```

This will create a program that is coredumping, I've named this sigabrt. We then use pgrep to find our pid.
```
[[email protected] /data/users/jalalonde/sand_test_code]$ pgrep "sigabrt"
3009565
```

And walk through attaching to the coredumping proc, and then trying to continue
```
[[email protected] /data/users/jalalonde/llvm-sand/dbg]$ ./bin/lldb
attach 3009565
Process 3009565 stopped
* thread llvm#1, name = 'sigabrt.out', stop reason = signal SIGSTOP
    frame #0: 0x00005618d35b114d sigabrt.out`main at sigabrt.cpp:6:18
   3    }
Executable binary set to "/data/users/jalalonde/sand_test_code/sigabrt.out".
Architecture set to: x86_64-unknown-linux-gnu.
warning: sigabrt.cpp: source file checksum mismatch between line table (69220de0d0b840cdf9c5b92a82466d5e) and file on disk (f76c8c2b73dc057688650908e7e17f34)
(lldb) bt
* thread llvm#1, name = 'sigabrt.out', stop reason = signal SIGSTOP
  * frame #0: 0x00005618d35b114d sigabrt.out`main at sigabrt.cpp:6:18
    frame llvm#1: 0x00007f0233a295d0 libc.so.6`__libc_start_call_main + 128
    frame llvm#2: 0x00007f0233a29680 libc.so.6`__libc_start_main@@GLIBC_2.34 + 128
    frame llvm#3: 0x00005618d35b1065 sigabrt.out`_start + 37
(lldb) continue
error: Failed to resume process: Process is in a non-resumable stop. Only detach or exit are supported..
(lldb) detach
Process 3009565 detached
```

So we succeeded in attaching, moving from 'S' to trace stop 't' and then preventing resumption would kill the proc, this is crucial so data formatters or other code doesn't accidentally resume and kill the process.

Test Matrix
| Case  | LLDB behavior before patch | LLDB Behavior with patch |
|-------|---------------------------------------------------------
| Attach to Root process | Waitpid hang until ctrl-c | Waitpid hang until ctrl-c |
| Attach to Root process coredumping | Waitpid hanging until ctrl-c | Waitpid hanging until ctrl-c |
| Run Expression | N/A Can't attach | Runs without jitting, no error
| Call function | N/A | Fails gracefully with error

Reviewers: gclayton, wanyi, jeffreytan, peix

Reviewed By: gclayton

Subscribers: davidayoung, peix, #lldb_team

Differential Revision: https://phabricator.intern.facebook.com/D73806555
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants