[lldb-dap] Fix raciness in launch and attach tests #137920

JDevlieghere · 2025-04-30T04:28:09Z

We've gotten multiple reports of the launch and attach test being flaky, both in CI and locally when running the tests. I believe the flakiness is due to a race between the main thread and the event handler thread.

Launching and attaching is done in synchronous mode so the corresponding requests return only after the respective operation has been completed. In synchronous mode, no stop event is emitted. When we have launch or attach commands, we cannot use synchronous mode. To hide the stop events in this case, the default event handler thread was ignoring stop events before the "configuration done" request was answered. The problem with that is that there's no guarantee that we have handled the stop event before we have handled the configuration done request.

When looking at the logs, you can see that we're still in the process of sending module events (which I recently added) when we receive, and respond to the "configuration done" request, before it sees the launch or attach stop event. At that point dap.configuration_done_sent is true and the event is sent, which the test doesn't expect.

This PR fixes the raciness by using an atomic flag to signal that the next stop event should be ignored. An alternative approach could be to stop trying to hide the initial stop event, and instead report it to the client unconditionally. Instead of ignoring the stop for the asynchronous case, we could send a stop event after we're done handling the synchronous case.

Fixes #137660

llvmbot · 2025-04-30T04:28:45Z

@llvm/pr-subscribers-lldb

Author: Jonas Devlieghere (JDevlieghere)

Changes

We've gotten multiple reports of the launch and attach test being flaky, both in CI and locally when running the tests. I believe the flakiness is due to a race between the main thread and the event handler thread.

Launching and attaching is done in synchronous mode so the corresponding requests return only after the respective operation has been completed. In synchronous mode, no stop event is emitted. When we have launch or attach commands, we cannot use synchronous mode. To hide the stop events in this case, the default event handler thread was ignoring stop events before the "configuration done" request was answered. The problem with that is that there's no guarantee that we have handled the stop event before we have handled the configuration done request.

When looking at the logs, you can see that we're still in the process of sending module events (which I recently added) when we receive, and respond to the "configuration done" request, before it sees the launch or attach stop event. At that point dap.configuration_done_sent is true and the event is sent, which the test doesn't expect.

This PR fixes the raciness by using an atomic flag to signal that the next stop event should be ignored. An alternative approach could be to stop trying to hide the initial stop event, and instead report it to the client unconditionally. Instead of ignoring the stop for the asynchronous case, we could send a stop event after we're done handling the synchronous case.

Fixes #137660

Full diff: https://github.com/llvm/llvm-project/pull/137920.diff

8 Files Affected:

(modified) lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py (+1-1)
(modified) lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py (-1)
(modified) lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py (-1)
(modified) lldb/tools/lldb-dap/DAP.cpp (+2-1)
(modified) lldb/tools/lldb-dap/DAP.h (+1)
(modified) lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp (+4)
(modified) lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp (+1-1)
(modified) lldb/tools/lldb-dap/Handler/RequestHandler.cpp (+1)

diff --git a/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py b/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py
index ee5272850b9a8..0d81b7d80102f 100644
--- a/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py
+++ b/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py
@@ -103,7 +103,7 @@ def verify_breakpoint_hit(self, breakpoint_ids):
                     match_desc = "breakpoint %s." % (breakpoint_id)
                     if match_desc in description:
                         return
-        self.assertTrue(False, "breakpoint not hit")
+        self.assertTrue(False, f"breakpoint not hit: stopped_events={stopped_events}")
 
     def verify_stop_exception_info(self, expected_description, timeout=timeoutval):
         """Wait for the process we are debugging to stop, and verify the stop
diff --git a/lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py b/lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py
index 6f70316821c8c..dcdfada2ff4c2 100644
--- a/lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py
+++ b/lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py
@@ -25,7 +25,6 @@ def spawn_and_wait(program, delay):
     process.wait()
 
 
-@skipIf
 class TestDAP_attach(lldbdap_testcase.DAPTestCaseBase):
     def set_and_hit_breakpoint(self, continueToExit=True):
         source = "main.c"
diff --git a/lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py b/lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py
index 51f62b79f3f4f..152e504af6d14 100644
--- a/lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py
+++ b/lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py
@@ -19,7 +19,6 @@
 import socket
 
 
-@skip
 class TestDAP_attachByPortNum(lldbdap_testcase.DAPTestCaseBase):
     default_timeout = 20
 
diff --git a/lldb/tools/lldb-dap/DAP.cpp b/lldb/tools/lldb-dap/DAP.cpp
index b593353110787..e191d8f2d3745 100644
--- a/lldb/tools/lldb-dap/DAP.cpp
+++ b/lldb/tools/lldb-dap/DAP.cpp
@@ -85,7 +85,8 @@ DAP::DAP(Log *log, const ReplMode default_repl_mode,
       exception_breakpoints(), focus_tid(LLDB_INVALID_THREAD_ID),
       stop_at_entry(false), is_attach(false),
       restarting_process_id(LLDB_INVALID_PROCESS_ID),
-      configuration_done_sent(false), waiting_for_run_in_terminal(false),
+      configuration_done_sent(false), ignore_next_stop(false),
+      waiting_for_run_in_terminal(false),
       progress_event_reporter(
           [&](const ProgressEvent &event) { SendJSON(event.ToJSON()); }),
       reverse_request_seq(0), repl_mode(default_repl_mode) {
diff --git a/lldb/tools/lldb-dap/DAP.h b/lldb/tools/lldb-dap/DAP.h
index 88eedb0860cf1..40d6400765a8d 100644
--- a/lldb/tools/lldb-dap/DAP.h
+++ b/lldb/tools/lldb-dap/DAP.h
@@ -189,6 +189,7 @@ struct DAP {
   // the old process here so we can detect this case and keep running.
   lldb::pid_t restarting_process_id;
   bool configuration_done_sent;
+  std::atomic<bool> ignore_next_stop;
   llvm::StringMap<std::unique_ptr<BaseRequestHandler>> request_handlers;
   bool waiting_for_run_in_terminal;
   ProgressEventReporter progress_event_reporter;
diff --git a/lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp b/lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp
index 3ef87cbef873c..5778ae53c9b0b 100644
--- a/lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp
+++ b/lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp
@@ -155,6 +155,9 @@ void AttachRequestHandler::operator()(const llvm::json::Object &request) const {
         std::string connect_url =
             llvm::formatv("connect://{0}:", gdb_remote_hostname);
         connect_url += std::to_string(gdb_remote_port);
+        // Connect remote will generate a stopped event even in synchronous
+        // mode.
+        dap.ignore_next_stop = true;
         dap.target.ConnectRemote(listener, connect_url.c_str(), "gdb-remote",
                                  error);
       } else {
@@ -166,6 +169,7 @@ void AttachRequestHandler::operator()(const llvm::json::Object &request) const {
     // Reenable async events
     dap.debugger.SetAsync(true);
   } else {
+    dap.ignore_next_stop = true;
     // We have "attachCommands" that are a set of commands that are expected
     // to execute the commands after which a process should be created. If there
     // is no valid process after running these commands, we have failed.
diff --git a/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp b/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp
index ce34c52bcc334..c53d7d5e2febf 100644
--- a/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp
+++ b/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp
@@ -167,7 +167,7 @@ static void EventThreadFunction(DAP &dap) {
             // stop events which we do not want to send an event for. We will
             // manually send a stopped event in request_configurationDone(...)
             // so don't send any before then.
-            if (dap.configuration_done_sent) {
+            if (!dap.ignore_next_stop.exchange(false)) {
               // Only report a stopped event if the process was not
               // automatically restarted.
               if (!lldb::SBProcess::GetRestartedFromEvent(event)) {
diff --git a/lldb/tools/lldb-dap/Handler/RequestHandler.cpp b/lldb/tools/lldb-dap/Handler/RequestHandler.cpp
index b7d3c8ced69f1..e7601a929aa6c 100644
--- a/lldb/tools/lldb-dap/Handler/RequestHandler.cpp
+++ b/lldb/tools/lldb-dap/Handler/RequestHandler.cpp
@@ -250,6 +250,7 @@ llvm::Error BaseRequestHandler::LaunchProcess(
     if (error.Fail())
       return llvm::make_error<DAPError>(error.GetCString());
   } else {
+    dap.ignore_next_stop = true;
     // Set the launch info so that run commands can access the configured
     // launch details.
     dap.target.SetLaunchInfo(launch_info);

labath

Someone once told me: "Whatever problem you're trying to solve, if you're using atomics, now you've got two problems."

I can't say I've always followed that advice, but I do think that atomics are rarely the right solution to a problem. And I think this is a good example of that.

What you want is not to suppress a event (whatever it might be). You want to suppress a very specific event (that is supposed to be generated by the action you're about to perform). That means that something needs to ensure that the other thread reads this flag after you've generated the event. If that's true, then you already have a happens-before relationship and the atomic is not necessary. If it isn't (there is no happens-before), then the atomic will not help.

That said, why is ConnectRemote generating a stopped event in synchronous mode? Maybe that's the bug we should fix?

ashgti · 2025-04-30T16:18:29Z

We could block the DAP queue until we get the stop event before we send the initialized event maybe.

llvm-project/lldb/tools/lldb-dap/Handler/LaunchRequestHandler.cpp

Lines 70 to 77 in d7f096e

    
           void LaunchRequestHandler::PostRun() const { 
        
             if (dap.target.GetProcess().IsValid()) { 
        
               // Attach happens when launching with runInTerminal. 
        
               SendProcessEvent(dap, dap.is_attach ? Attach : Launch); 
        
             } 
        
             dap.SendJSON(CreateEventObject("initialized")); 
        
           }

is where I am thinking we can wait for the stopped event. That may also fix this by preventing us from handling the next DAP request until we get the 'stopped' event, which I think would arrive after the 'module' events are done processing.

JDevlieghere · 2025-04-30T20:01:39Z

Someone once told me: "Whatever problem you're trying to solve, if you're using atomics, now you've got two problems."

I can't say I've always followed that advice, but I do think that atomics are rarely the right solution to a problem. And I think this is a good example of that.

What you want is not to suppress a event (whatever it might be). You want to suppress a very specific event (that is supposed to be generated by the action you're about to perform). That means that something needs to ensure that the other thread reads this flag after you've generated the event. If that's true, then you already have a happens-before relationship and the atomic is not necessary. If it isn't (there is no happens-before), then the atomic will not help.

You're right. I have an implementation that used a mutex but it made the same mistake I have here with the atomic. Luckily that's easily solvable by putting the code that generates the stop and consumes the stop in the critical section.

That said, why is ConnectRemote generating a stopped event in synchronous mode? Maybe that's the bug we should fix?

That's a long standing bug that comes up about once or twice a year. I spend some time looking into that at some point, but it's so long ago that I don't remember what the blocker was. That said, solving that isn't sufficient for the problem at hand, since we need to solve the asynchronous case as well.

JDevlieghere · 2025-05-01T00:24:26Z

I changed the implementation to use a mutex and address Pavel's concern. I'm even more unhappy with the current state of the patch. The configurationDone request (*) now blocks until the stop event has been processed. Unfortunately, that leads to a potential deadlock between the mutex protecting the data member representing the first attach and the API mutex. Specifically, some events (like module and breakpoint events) require the target API mutex. The patch temporarily unlocks the mutex.

(*) RE @ashgti's point: we could move this logic into LaunchRequestHandler::PostRun() and AttachRequestHandler::PostRun() but the latter doesn't exist yet so for the current patch I kept it in the configurationDone request, although I agree that the launch and attach request would be a better place to do this.

ashgti · 2025-05-01T00:59:13Z

lldb/tools/lldb-dap/Handler/ConfigurationDoneRequestHandler.cpp

+
+  {
+    // Temporary unlock the API mutex to avoid a deadlock between the API mutex
+    // and the first stop mutex.
+    lock.unlock();
+
+    // Block until we have either ignored the fist stop event or we didn't
+    // generate one because we attached or launched in synchronous mode.
+    std::unique_lock<std::mutex> stop_lock(dap.first_stop_mutex);
+    dap.first_stop_cv.wait(stop_lock, [&] {
+      return dap.first_stop_state == DAP::FirstStopState::NoStopEvent ||
+             dap.first_stop_state == DAP::FirstStopState::IgnoredStopEvent;
+    });
+
+    // Relock the API mutex.
+    lock.lock();
+  }
+


I think we should move this above sending the response to the configurationDone.

The initialized event triggers the rest of the adapter configs (e.g. sending all the breakpoints) and the configurationDone request should be the last in that chain.

In theory, could could even move the dap.WaitForProcessToStop(arguments.timeout); call out of the attach/launch requests and have that here instead and fully expect the attach/launch flow to be async.

ashgti · 2025-05-01T01:02:35Z

lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp

-              if (!lldb::SBProcess::GetRestartedFromEvent(event)) {
-                SendStdOutStdErr(dap, process);
-                SendThreadStoppedEvent(dap);
+            {


What if we didn't start the event thread until the configurationDone request was called?

Do we miss any events that would have happened before this point?

Or maybe we don't subscribe to process events until we get the configurationDone request? Could that alleviate the need for this lock?

I think that's not an option. Every process must have an event handler and IIUC (@jimingham should be able to confirm) if you don't handle the event, the process will never get out of the "launching" or "attaching" state.

That is true, but it should only matter if someone actually cares about the process in the mean time. Provided we don't need to inspect the state, a sequence like this should be fine:

cause a process event to be emitted

start the listener thread

listener thread receives the event

But I don't know if that's the case here...

Another potential option: If we are sure the event is going to be emitted (and the listener thread is not running), maybe we can wait for the event (synchronously) on the thread which is processing the request.

ashgti

LGTM

walter-erquinigo · 2025-05-01T17:19:59Z

Just throwing out some ideas that might simplify this.
Is it possible to do the launching and attaching in asynchronous mode so that the stop events are always emitted?
Also, the configuration done event can be emitted at any time during initialization. It could even be emitted before the actual launching and attaching happen.

Would any of this help?

ashgti · 2025-05-01T17:46:39Z

To clarify the DAP flow a little, see 'Launch Sequencing' in https://microsoft.github.io/debug-adapter-protocol/overview

Once the DAP server send the response to the initialize request the following happen in parallel:

The initialized event triggers the client sending setBreakpoints, etc. with the configurationDone request at the end of the chain.
launch or attach requests are sent and we send back a process event.

The flow from VSCode at least often ends up with:

initialize seq=1
launch seq=2
setBreakpoints seq=3
configurationDone seq=4

But these do not strictly have to be in that order.

The DAP server can respond to these events asynchronously but today the lldb-dap is single threaded for handling each of these responses.

Looking at the failing test https://lab.llvm.org/buildbot/#/builders/18/builds/15174/steps/6/logs/stdio, we had the following sequence:

{"command":"initialize", ...}
{"command":"attach", ...}
{"command":"setBreakpoints", ...}
{"command":"configurationDone", ...}
{"event":"stopped", ...}

But that stopped event here is wrong. This attach did not set "stopOnEntry": true, so that should have continued the process at point.

kusmour · 2025-05-01T18:35:08Z

Also, the configuration done event can be emitted at any time during initialization. It could even be emitted before the actual launching and attaching happen.

No, this is something we can control. configurationDone request will only emit after initialized event. And lldb-dap only sends that after launch/attach response. (This is also not complying with the DAP, see below)

The DAP server can respond to these events asynchronously but today the lldb-dap is single threaded for handling each of these responses.

It's important to point out that in the DAP specification:

After the response to configurationDone is sent, the debug adapter may respond to the launch or attach request, and then the debug session has started.

Technically the response of launch/attach should be the end of the chain. But because of the single threaded handling we can't do this. I don't think this is impacting the client side (at least for VSCode, it doesn't seem to do anything special upon an launch/attach response). But this suggests that lldb-dap should be able to async handling the requests/events.

ashgti · 2025-05-01T18:51:47Z

Technically the response of launch/attach should be the end of the chain.

Thats not how its implemented in VS Code at least:

initialize is sent then launch or attach https://github.com/microsoft/vscode/blob/main/src/vs/workbench/contrib/debug/browser/debugService.ts#L674-L675, these two happen sequentially and its not coordinating this with the initialized event or configurationDone request.
When initialized is received, it triggers the setBreakpoints then configurationDone https://github.com/microsoft/vscode/blob/main/src/vs/workbench/contrib/debug/browser/debugSession.ts#L1063-L1091

kusmour · 2025-05-01T19:07:30Z

Technically the response of launch/attach should be the end of the chain.

Thats not how its implemented in VS Code at least:

initialize is sent then launch or attach https://github.com/microsoft/vscode/blob/main/src/vs/workbench/contrib/debug/browser/debugService.ts#L674-L675, these two happen sequentially and its not coordinating this with the initialized event or configurationDone request.

When initialized is received, it triggers the setBreakpoints then configurationDone https://github.com/microsoft/vscode/blob/main/src/vs/workbench/contrib/debug/browser/debugSession.ts#L1063-L1091

Yes but that code doesn't define how adapter should respond to those requests, the specification does. What you showed above doesn't conflict with having the response being the end of the launch sequence and the mark of "the debug session has started". They don't need to coordinate with initialized event and they shouldn't. The adapter should use initialized to signal it can take breakpoint requests.

JDevlieghere · 2025-05-01T19:28:03Z

Is it possible to do the launching and attaching in asynchronous mode so that the stop events are always emitted?

Yes, that's the alternative I mentioned in the PR description:

An alternative approach could be to stop trying to hide the initial stop event, and instead report it to the client unconditionally. Instead of ignoring the stop for the asynchronous case, we could send a stop event after we're done handling the synchronous case.

Thought I need to confirm that this would be compliant with the spec. I also think that from a user experience point of view, you probably don't want to see the stop (similar to how LLDB doesn't promote every private stop to a public stop). When attaching, we always stop. For launching, we hide the initial stop unless you requested stop-at-entry.

JDevlieghere · 2025-05-01T19:55:54Z

After the response to configurationDone is sent, the debug adapter may respond to the launch or attach request, and then the debug session has started.

Technically the response of launch/attach should be the end of the chain. But because of the single threaded handling we can't do this. I don't think this is impacting the client side (at least for VSCode, it doesn't seem to do anything special upon an launch/attach response). But this suggests that lldb-dap should be able to async handling the requests/events.

This is really interesting, thanks for pointing that out @kusmour. While it's true that the single threaded nature of how we handle requests makes it tricky to delay the response, we could delay handling the request altogether until the ConfigurationDone has been sent. For example, with the current queue approach, we could park the launch and attach requests off to the side until the configuration is done.

This PR changes how we treat the launch sequence in lldb-dap. - Send the initialized event after we finish handling the initialize request, rather than after we finish attaching or launching. - Delay handling the launch and attach request until we have received the configurationDone request. The latter is now largely a NO-OP and only exists to signal lldb-dap that it can handle the launch and attach requests. - Add synchronization for ignoring the first stop event. I originally wanted to do all the launching and attaching in asynchronous mode, but that's a little bit trickier than expected. On macOS, I'm getting an additional stop when we hit the dyld breakpoint. This might be something I come back to, but for now I'm keeping the old behavior with the proposed synchronization from llvm#137920. Background: https://discourse.llvm.org/t/reliability-of-the-lldb-dap-tests/86125

labath · 2025-05-02T12:13:25Z

I'm even more unhappy with the current state of the patch.

I take it you're referring to the uncritical sections. What if we don't lock the mutex automatically for each handler (but leave it up to them to figure out when/if to lock it)? Or we could have the handler declare in some way whether it wants to run the the lock held?

JDevlieghere requested review from ashgti, walter-erquinigo, clayborg, felipepiovezan and jimingham April 30, 2025 04:28

llvmbot added lldb lldb-dap labels Apr 30, 2025

labath reviewed Apr 30, 2025

View reviewed changes

[lldb-dap] Fix raciness in launch and attach tests llvm#137920

1bdae98

JDevlieghere force-pushed the lldb-dap-raciness branch from bedc08b to 1bdae98 Compare May 1, 2025 00:20

ashgti reviewed May 1, 2025

View reviewed changes

Block before sending the response to the configurationDone.

c8f4f0a

ashgti approved these changes May 1, 2025

View reviewed changes

JDevlieghere mentioned this pull request May 1, 2025

[lldb-dap] Change the launch sequence #138219

Merged

JDevlieghere closed this May 18, 2025

[lldb-dap] Fix raciness in launch and attach tests #137920

[lldb-dap] Fix raciness in launch and attach tests #137920

Uh oh!

Conversation

JDevlieghere commented Apr 30, 2025

Uh oh!

llvmbot commented Apr 30, 2025

Uh oh!

labath left a comment

Choose a reason for hiding this comment

Uh oh!

ashgti commented Apr 30, 2025

Uh oh!

JDevlieghere commented Apr 30, 2025

Uh oh!

JDevlieghere commented May 1, 2025

Uh oh!

ashgti May 1, 2025

Choose a reason for hiding this comment

Uh oh!

ashgti May 1, 2025

Choose a reason for hiding this comment

Uh oh!

JDevlieghere May 1, 2025

Choose a reason for hiding this comment

Uh oh!

labath May 2, 2025

Choose a reason for hiding this comment

Uh oh!

ashgti left a comment

Choose a reason for hiding this comment

Uh oh!

walter-erquinigo commented May 1, 2025

Uh oh!

ashgti commented May 1, 2025

Uh oh!

kusmour commented May 1, 2025

Uh oh!

ashgti commented May 1, 2025

Uh oh!

kusmour commented May 1, 2025

Uh oh!

JDevlieghere commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JDevlieghere commented May 1, 2025

Uh oh!

labath commented May 2, 2025

Uh oh!

Uh oh!

JDevlieghere commented May 1, 2025 •

edited

Loading