Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[lldb-dap] Fix raciness in launch and attach tests #137920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

JDevlieghere
Copy link
Member

We've gotten multiple reports of the launch and attach test being flaky, both in CI and locally when running the tests. I believe the flakiness is due to a race between the main thread and the event handler thread.

Launching and attaching is done in synchronous mode so the corresponding requests return only after the respective operation has been completed. In synchronous mode, no stop event is emitted. When we have launch or attach commands, we cannot use synchronous mode. To hide the stop events in this case, the default event handler thread was ignoring stop events before the "configuration done" request was answered. The problem with that is that there's no guarantee that we have handled the stop event before we have handled the configuration done request.

When looking at the logs, you can see that we're still in the process of sending module events (which I recently added) when we receive, and respond to the "configuration done" request, before it sees the launch or attach stop event. At that point dap.configuration_done_sent is true and the event is sent, which the test doesn't expect.

This PR fixes the raciness by using an atomic flag to signal that the next stop event should be ignored. An alternative approach could be to stop trying to hide the initial stop event, and instead report it to the client unconditionally. Instead of ignoring the stop for the asynchronous case, we could send a stop event after we're done handling the synchronous case.

Fixes #137660

@llvmbot
Copy link
Member

llvmbot commented Apr 30, 2025

@llvm/pr-subscribers-lldb

Author: Jonas Devlieghere (JDevlieghere)

Changes

We've gotten multiple reports of the launch and attach test being flaky, both in CI and locally when running the tests. I believe the flakiness is due to a race between the main thread and the event handler thread.

Launching and attaching is done in synchronous mode so the corresponding requests return only after the respective operation has been completed. In synchronous mode, no stop event is emitted. When we have launch or attach commands, we cannot use synchronous mode. To hide the stop events in this case, the default event handler thread was ignoring stop events before the "configuration done" request was answered. The problem with that is that there's no guarantee that we have handled the stop event before we have handled the configuration done request.

When looking at the logs, you can see that we're still in the process of sending module events (which I recently added) when we receive, and respond to the "configuration done" request, before it sees the launch or attach stop event. At that point dap.configuration_done_sent is true and the event is sent, which the test doesn't expect.

This PR fixes the raciness by using an atomic flag to signal that the next stop event should be ignored. An alternative approach could be to stop trying to hide the initial stop event, and instead report it to the client unconditionally. Instead of ignoring the stop for the asynchronous case, we could send a stop event after we're done handling the synchronous case.

Fixes #137660


Full diff: https://github.com/llvm/llvm-project/pull/137920.diff

8 Files Affected:

  • (modified) lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py (+1-1)
  • (modified) lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py (-1)
  • (modified) lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py (-1)
  • (modified) lldb/tools/lldb-dap/DAP.cpp (+2-1)
  • (modified) lldb/tools/lldb-dap/DAP.h (+1)
  • (modified) lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp (+4)
  • (modified) lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp (+1-1)
  • (modified) lldb/tools/lldb-dap/Handler/RequestHandler.cpp (+1)
diff --git a/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py b/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py
index ee5272850b9a8..0d81b7d80102f 100644
--- a/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py
+++ b/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py
@@ -103,7 +103,7 @@ def verify_breakpoint_hit(self, breakpoint_ids):
                     match_desc = "breakpoint %s." % (breakpoint_id)
                     if match_desc in description:
                         return
-        self.assertTrue(False, "breakpoint not hit")
+        self.assertTrue(False, f"breakpoint not hit: stopped_events={stopped_events}")
 
     def verify_stop_exception_info(self, expected_description, timeout=timeoutval):
         """Wait for the process we are debugging to stop, and verify the stop
diff --git a/lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py b/lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py
index 6f70316821c8c..dcdfada2ff4c2 100644
--- a/lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py
+++ b/lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py
@@ -25,7 +25,6 @@ def spawn_and_wait(program, delay):
     process.wait()
 
 
-@skipIf
 class TestDAP_attach(lldbdap_testcase.DAPTestCaseBase):
     def set_and_hit_breakpoint(self, continueToExit=True):
         source = "main.c"
diff --git a/lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py b/lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py
index 51f62b79f3f4f..152e504af6d14 100644
--- a/lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py
+++ b/lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py
@@ -19,7 +19,6 @@
 import socket
 
 
-@skip
 class TestDAP_attachByPortNum(lldbdap_testcase.DAPTestCaseBase):
     default_timeout = 20
 
diff --git a/lldb/tools/lldb-dap/DAP.cpp b/lldb/tools/lldb-dap/DAP.cpp
index b593353110787..e191d8f2d3745 100644
--- a/lldb/tools/lldb-dap/DAP.cpp
+++ b/lldb/tools/lldb-dap/DAP.cpp
@@ -85,7 +85,8 @@ DAP::DAP(Log *log, const ReplMode default_repl_mode,
       exception_breakpoints(), focus_tid(LLDB_INVALID_THREAD_ID),
       stop_at_entry(false), is_attach(false),
       restarting_process_id(LLDB_INVALID_PROCESS_ID),
-      configuration_done_sent(false), waiting_for_run_in_terminal(false),
+      configuration_done_sent(false), ignore_next_stop(false),
+      waiting_for_run_in_terminal(false),
       progress_event_reporter(
           [&](const ProgressEvent &event) { SendJSON(event.ToJSON()); }),
       reverse_request_seq(0), repl_mode(default_repl_mode) {
diff --git a/lldb/tools/lldb-dap/DAP.h b/lldb/tools/lldb-dap/DAP.h
index 88eedb0860cf1..40d6400765a8d 100644
--- a/lldb/tools/lldb-dap/DAP.h
+++ b/lldb/tools/lldb-dap/DAP.h
@@ -189,6 +189,7 @@ struct DAP {
   // the old process here so we can detect this case and keep running.
   lldb::pid_t restarting_process_id;
   bool configuration_done_sent;
+  std::atomic<bool> ignore_next_stop;
   llvm::StringMap<std::unique_ptr<BaseRequestHandler>> request_handlers;
   bool waiting_for_run_in_terminal;
   ProgressEventReporter progress_event_reporter;
diff --git a/lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp b/lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp
index 3ef87cbef873c..5778ae53c9b0b 100644
--- a/lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp
+++ b/lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp
@@ -155,6 +155,9 @@ void AttachRequestHandler::operator()(const llvm::json::Object &request) const {
         std::string connect_url =
             llvm::formatv("connect://{0}:", gdb_remote_hostname);
         connect_url += std::to_string(gdb_remote_port);
+        // Connect remote will generate a stopped event even in synchronous
+        // mode.
+        dap.ignore_next_stop = true;
         dap.target.ConnectRemote(listener, connect_url.c_str(), "gdb-remote",
                                  error);
       } else {
@@ -166,6 +169,7 @@ void AttachRequestHandler::operator()(const llvm::json::Object &request) const {
     // Reenable async events
     dap.debugger.SetAsync(true);
   } else {
+    dap.ignore_next_stop = true;
     // We have "attachCommands" that are a set of commands that are expected
     // to execute the commands after which a process should be created. If there
     // is no valid process after running these commands, we have failed.
diff --git a/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp b/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp
index ce34c52bcc334..c53d7d5e2febf 100644
--- a/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp
+++ b/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp
@@ -167,7 +167,7 @@ static void EventThreadFunction(DAP &dap) {
             // stop events which we do not want to send an event for. We will
             // manually send a stopped event in request_configurationDone(...)
             // so don't send any before then.
-            if (dap.configuration_done_sent) {
+            if (!dap.ignore_next_stop.exchange(false)) {
               // Only report a stopped event if the process was not
               // automatically restarted.
               if (!lldb::SBProcess::GetRestartedFromEvent(event)) {
diff --git a/lldb/tools/lldb-dap/Handler/RequestHandler.cpp b/lldb/tools/lldb-dap/Handler/RequestHandler.cpp
index b7d3c8ced69f1..e7601a929aa6c 100644
--- a/lldb/tools/lldb-dap/Handler/RequestHandler.cpp
+++ b/lldb/tools/lldb-dap/Handler/RequestHandler.cpp
@@ -250,6 +250,7 @@ llvm::Error BaseRequestHandler::LaunchProcess(
     if (error.Fail())
       return llvm::make_error<DAPError>(error.GetCString());
   } else {
+    dap.ignore_next_stop = true;
     // Set the launch info so that run commands can access the configured
     // launch details.
     dap.target.SetLaunchInfo(launch_info);

Copy link
Collaborator

@labath labath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone once told me: "Whatever problem you're trying to solve, if you're using atomics, now you've got two problems."

I can't say I've always followed that advice, but I do think that atomics are rarely the right solution to a problem. And I think this is a good example of that.

What you want is not to suppress a event (whatever it might be). You want to suppress a very specific event (that is supposed to be generated by the action you're about to perform). That means that something needs to ensure that the other thread reads this flag after you've generated the event. If that's true, then you already have a happens-before relationship and the atomic is not necessary. If it isn't (there is no happens-before), then the atomic will not help.

That said, why is ConnectRemote generating a stopped event in synchronous mode? Maybe that's the bug we should fix?

@ashgti
Copy link
Contributor

ashgti commented Apr 30, 2025

We could block the DAP queue until we get the stop event before we send the initialized event maybe.

void LaunchRequestHandler::PostRun() const {
if (dap.target.GetProcess().IsValid()) {
// Attach happens when launching with runInTerminal.
SendProcessEvent(dap, dap.is_attach ? Attach : Launch);
}
dap.SendJSON(CreateEventObject("initialized"));
}
is where I am thinking we can wait for the stopped event. That may also fix this by preventing us from handling the next DAP request until we get the 'stopped' event, which I think would arrive after the 'module' events are done processing.

@JDevlieghere
Copy link
Member Author

Someone once told me: "Whatever problem you're trying to solve, if you're using atomics, now you've got two problems."

I can't say I've always followed that advice, but I do think that atomics are rarely the right solution to a problem. And I think this is a good example of that.

What you want is not to suppress a event (whatever it might be). You want to suppress a very specific event (that is supposed to be generated by the action you're about to perform). That means that something needs to ensure that the other thread reads this flag after you've generated the event. If that's true, then you already have a happens-before relationship and the atomic is not necessary. If it isn't (there is no happens-before), then the atomic will not help.

You're right. I have an implementation that used a mutex but it made the same mistake I have here with the atomic. Luckily that's easily solvable by putting the code that generates the stop and consumes the stop in the critical section.

That said, why is ConnectRemote generating a stopped event in synchronous mode? Maybe that's the bug we should fix?

That's a long standing bug that comes up about once or twice a year. I spend some time looking into that at some point, but it's so long ago that I don't remember what the blocker was. That said, solving that isn't sufficient for the problem at hand, since we need to solve the asynchronous case as well.

@JDevlieghere
Copy link
Member Author

I changed the implementation to use a mutex and address Pavel's concern. I'm even more unhappy with the current state of the patch. The configurationDone request (*) now blocks until the stop event has been processed. Unfortunately, that leads to a potential deadlock between the mutex protecting the data member representing the first attach and the API mutex. Specifically, some events (like module and breakpoint events) require the target API mutex. The patch temporarily unlocks the mutex.

(*) RE @ashgti's point: we could move this logic into LaunchRequestHandler::PostRun() and AttachRequestHandler::PostRun() but the latter doesn't exist yet so for the current patch I kept it in the configurationDone request, although I agree that the launch and attach request would be a better place to do this.

Comment on lines 54 to 71

{
// Temporary unlock the API mutex to avoid a deadlock between the API mutex
// and the first stop mutex.
lock.unlock();

// Block until we have either ignored the fist stop event or we didn't
// generate one because we attached or launched in synchronous mode.
std::unique_lock<std::mutex> stop_lock(dap.first_stop_mutex);
dap.first_stop_cv.wait(stop_lock, [&] {
return dap.first_stop_state == DAP::FirstStopState::NoStopEvent ||
dap.first_stop_state == DAP::FirstStopState::IgnoredStopEvent;
});

// Relock the API mutex.
lock.lock();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this above sending the response to the configurationDone.

The initialized event triggers the rest of the adapter configs (e.g. sending all the breakpoints) and the configurationDone request should be the last in that chain.

In theory, could could even move the dap.WaitForProcessToStop(arguments.timeout); call out of the attach/launch requests and have that here instead and fully expect the attach/launch flow to be async.

if (!lldb::SBProcess::GetRestartedFromEvent(event)) {
SendStdOutStdErr(dap, process);
SendThreadStoppedEvent(dap);
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we didn't start the event thread until the configurationDone request was called?

Do we miss any events that would have happened before this point?

Or maybe we don't subscribe to process events until we get the configurationDone request? Could that alleviate the need for this lock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's not an option. Every process must have an event handler and IIUC (@jimingham should be able to confirm) if you don't handle the event, the process will never get out of the "launching" or "attaching" state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, but it should only matter if someone actually cares about the process in the mean time. Provided we don't need to inspect the state, a sequence like this should be fine:

  1. cause a process event to be emitted
  2. start the listener thread
  3. listener thread receives the event

But I don't know if that's the case here...

Another potential option: If we are sure the event is going to be emitted (and the listener thread is not running), maybe we can wait for the event (synchronously) on the thread which is processing the request.

Copy link
Contributor

@ashgti ashgti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@walter-erquinigo
Copy link
Member

Just throwing out some ideas that might simplify this.
Is it possible to do the launching and attaching in asynchronous mode so that the stop events are always emitted?
Also, the configuration done event can be emitted at any time during initialization. It could even be emitted before the actual launching and attaching happen.

Would any of this help?

@ashgti
Copy link
Contributor

ashgti commented May 1, 2025

To clarify the DAP flow a little, see 'Launch Sequencing' in https://microsoft.github.io/debug-adapter-protocol/overview

Once the DAP server send the response to the initialize request the following happen in parallel:

  • The initialized event triggers the client sending setBreakpoints, etc. with the configurationDone request at the end of the chain.
  • launch or attach requests are sent and we send back a process event.

The flow from VSCode at least often ends up with:

  • initialize seq=1
  • launch seq=2
  • setBreakpoints seq=3
  • configurationDone seq=4

But these do not strictly have to be in that order.

The DAP server can respond to these events asynchronously but today the lldb-dap is single threaded for handling each of these responses.

Looking at the failing test https://lab.llvm.org/buildbot/#/builders/18/builds/15174/steps/6/logs/stdio, we had the following sequence:

{"command":"initialize", ...}
{"command":"attach", ...}
{"command":"setBreakpoints", ...}
{"command":"configurationDone", ...}
{"event":"stopped", ...}

But that stopped event here is wrong. This attach did not set "stopOnEntry": true, so that should have continued the process at point.

@kusmour
Copy link
Contributor

kusmour commented May 1, 2025

Also, the configuration done event can be emitted at any time during initialization. It could even be emitted before the actual launching and attaching happen.

No, this is something we can control. configurationDone request will only emit after initialized event. And lldb-dap only sends that after launch/attach response. (This is also not complying with the DAP, see below)

The DAP server can respond to these events asynchronously but today the lldb-dap is single threaded for handling each of these responses.

It's important to point out that in the DAP specification:

After the response to configurationDone is sent, the debug adapter may respond to the launch or attach request, and then the debug session has started.

Technically the response of launch/attach should be the end of the chain. But because of the single threaded handling we can't do this. I don't think this is impacting the client side (at least for VSCode, it doesn't seem to do anything special upon an launch/attach response). But this suggests that lldb-dap should be able to async handling the requests/events.

@ashgti
Copy link
Contributor

ashgti commented May 1, 2025

Technically the response of launch/attach should be the end of the chain.

Thats not how its implemented in VS Code at least:

@kusmour
Copy link
Contributor

kusmour commented May 1, 2025

Technically the response of launch/attach should be the end of the chain.

Thats not how its implemented in VS Code at least:

Yes but that code doesn't define how adapter should respond to those requests, the specification does. What you showed above doesn't conflict with having the response being the end of the launch sequence and the mark of "the debug session has started". They don't need to coordinate with initialized event and they shouldn't. The adapter should use initialized to signal it can take breakpoint requests.

@JDevlieghere
Copy link
Member Author

JDevlieghere commented May 1, 2025

Is it possible to do the launching and attaching in asynchronous mode so that the stop events are always emitted?

Yes, that's the alternative I mentioned in the PR description:

An alternative approach could be to stop trying to hide the initial stop event, and instead report it to the client unconditionally. Instead of ignoring the stop for the asynchronous case, we could send a stop event after we're done handling the synchronous case.

Thought I need to confirm that this would be compliant with the spec. I also think that from a user experience point of view, you probably don't want to see the stop (similar to how LLDB doesn't promote every private stop to a public stop). When attaching, we always stop. For launching, we hide the initial stop unless you requested stop-at-entry.

@JDevlieghere
Copy link
Member Author

After the response to configurationDone is sent, the debug adapter may respond to the launch or attach request, and then the debug session has started.

Technically the response of launch/attach should be the end of the chain. But because of the single threaded handling we can't do this. I don't think this is impacting the client side (at least for VSCode, it doesn't seem to do anything special upon an launch/attach response). But this suggests that lldb-dap should be able to async handling the requests/events.

This is really interesting, thanks for pointing that out @kusmour. While it's true that the single threaded nature of how we handle requests makes it tricky to delay the response, we could delay handling the request altogether until the ConfigurationDone has been sent. For example, with the current queue approach, we could park the launch and attach requests off to the side until the configuration is done.

JDevlieghere added a commit to JDevlieghere/llvm-project that referenced this pull request May 1, 2025
This PR changes how we treat the launch sequence in lldb-dap.

 - Send the initialized event after we finish handling the initialize
   request, rather than after we finish attaching or launching.
 - Delay handling the launch and attach request until we have received
   the configurationDone request. The latter is now largely a NO-OP and
   only exists to signal lldb-dap that it can handle the launch and
   attach requests.
 - Add synchronization for ignoring the first stop event. I originally
   wanted to do all the launching and attaching in asynchronous mode,
   but that's a little bit trickier than expected. On macOS, I'm getting
   an additional stop when we hit the dyld breakpoint. This might be
   something I come back to, but for now I'm keeping the old behavior
   with the proposed synchronization from llvm#137920.

Background: https://discourse.llvm.org/t/reliability-of-the-lldb-dap-tests/86125
@labath
Copy link
Collaborator

labath commented May 2, 2025

I'm even more unhappy with the current state of the patch.

I take it you're referring to the uncritical sections. What if we don't lock the mutex automatically for each handler (but leave it up to them to figure out when/if to lock it)? Or we could have the handler declare in some way whether it wants to run the the lock held?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[LLDB] Flakey DAP tests on builders
6 participants