-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[lldb-dap] Fix raciness in launch and attach tests #137920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-lldb Author: Jonas Devlieghere (JDevlieghere) ChangesWe've gotten multiple reports of the launch and attach test being flaky, both in CI and locally when running the tests. I believe the flakiness is due to a race between the main thread and the event handler thread. Launching and attaching is done in synchronous mode so the corresponding requests return only after the respective operation has been completed. In synchronous mode, no stop event is emitted. When we have launch or attach commands, we cannot use synchronous mode. To hide the stop events in this case, the default event handler thread was ignoring stop events before the "configuration done" request was answered. The problem with that is that there's no guarantee that we have handled the stop event before we have handled the configuration done request. When looking at the logs, you can see that we're still in the process of sending module events (which I recently added) when we receive, and respond to the "configuration done" request, before it sees the launch or attach stop event. At that point This PR fixes the raciness by using an atomic flag to signal that the next stop event should be ignored. An alternative approach could be to stop trying to hide the initial stop event, and instead report it to the client unconditionally. Instead of ignoring the stop for the asynchronous case, we could send a stop event after we're done handling the synchronous case. Fixes #137660 Full diff: https://github.com/llvm/llvm-project/pull/137920.diff 8 Files Affected:
diff --git a/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py b/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py
index ee5272850b9a8..0d81b7d80102f 100644
--- a/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py
+++ b/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/lldbdap_testcase.py
@@ -103,7 +103,7 @@ def verify_breakpoint_hit(self, breakpoint_ids):
match_desc = "breakpoint %s." % (breakpoint_id)
if match_desc in description:
return
- self.assertTrue(False, "breakpoint not hit")
+ self.assertTrue(False, f"breakpoint not hit: stopped_events={stopped_events}")
def verify_stop_exception_info(self, expected_description, timeout=timeoutval):
"""Wait for the process we are debugging to stop, and verify the stop
diff --git a/lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py b/lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py
index 6f70316821c8c..dcdfada2ff4c2 100644
--- a/lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py
+++ b/lldb/test/API/tools/lldb-dap/attach/TestDAP_attach.py
@@ -25,7 +25,6 @@ def spawn_and_wait(program, delay):
process.wait()
-@skipIf
class TestDAP_attach(lldbdap_testcase.DAPTestCaseBase):
def set_and_hit_breakpoint(self, continueToExit=True):
source = "main.c"
diff --git a/lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py b/lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py
index 51f62b79f3f4f..152e504af6d14 100644
--- a/lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py
+++ b/lldb/test/API/tools/lldb-dap/attach/TestDAP_attachByPortNum.py
@@ -19,7 +19,6 @@
import socket
-@skip
class TestDAP_attachByPortNum(lldbdap_testcase.DAPTestCaseBase):
default_timeout = 20
diff --git a/lldb/tools/lldb-dap/DAP.cpp b/lldb/tools/lldb-dap/DAP.cpp
index b593353110787..e191d8f2d3745 100644
--- a/lldb/tools/lldb-dap/DAP.cpp
+++ b/lldb/tools/lldb-dap/DAP.cpp
@@ -85,7 +85,8 @@ DAP::DAP(Log *log, const ReplMode default_repl_mode,
exception_breakpoints(), focus_tid(LLDB_INVALID_THREAD_ID),
stop_at_entry(false), is_attach(false),
restarting_process_id(LLDB_INVALID_PROCESS_ID),
- configuration_done_sent(false), waiting_for_run_in_terminal(false),
+ configuration_done_sent(false), ignore_next_stop(false),
+ waiting_for_run_in_terminal(false),
progress_event_reporter(
[&](const ProgressEvent &event) { SendJSON(event.ToJSON()); }),
reverse_request_seq(0), repl_mode(default_repl_mode) {
diff --git a/lldb/tools/lldb-dap/DAP.h b/lldb/tools/lldb-dap/DAP.h
index 88eedb0860cf1..40d6400765a8d 100644
--- a/lldb/tools/lldb-dap/DAP.h
+++ b/lldb/tools/lldb-dap/DAP.h
@@ -189,6 +189,7 @@ struct DAP {
// the old process here so we can detect this case and keep running.
lldb::pid_t restarting_process_id;
bool configuration_done_sent;
+ std::atomic<bool> ignore_next_stop;
llvm::StringMap<std::unique_ptr<BaseRequestHandler>> request_handlers;
bool waiting_for_run_in_terminal;
ProgressEventReporter progress_event_reporter;
diff --git a/lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp b/lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp
index 3ef87cbef873c..5778ae53c9b0b 100644
--- a/lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp
+++ b/lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp
@@ -155,6 +155,9 @@ void AttachRequestHandler::operator()(const llvm::json::Object &request) const {
std::string connect_url =
llvm::formatv("connect://{0}:", gdb_remote_hostname);
connect_url += std::to_string(gdb_remote_port);
+ // Connect remote will generate a stopped event even in synchronous
+ // mode.
+ dap.ignore_next_stop = true;
dap.target.ConnectRemote(listener, connect_url.c_str(), "gdb-remote",
error);
} else {
@@ -166,6 +169,7 @@ void AttachRequestHandler::operator()(const llvm::json::Object &request) const {
// Reenable async events
dap.debugger.SetAsync(true);
} else {
+ dap.ignore_next_stop = true;
// We have "attachCommands" that are a set of commands that are expected
// to execute the commands after which a process should be created. If there
// is no valid process after running these commands, we have failed.
diff --git a/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp b/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp
index ce34c52bcc334..c53d7d5e2febf 100644
--- a/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp
+++ b/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp
@@ -167,7 +167,7 @@ static void EventThreadFunction(DAP &dap) {
// stop events which we do not want to send an event for. We will
// manually send a stopped event in request_configurationDone(...)
// so don't send any before then.
- if (dap.configuration_done_sent) {
+ if (!dap.ignore_next_stop.exchange(false)) {
// Only report a stopped event if the process was not
// automatically restarted.
if (!lldb::SBProcess::GetRestartedFromEvent(event)) {
diff --git a/lldb/tools/lldb-dap/Handler/RequestHandler.cpp b/lldb/tools/lldb-dap/Handler/RequestHandler.cpp
index b7d3c8ced69f1..e7601a929aa6c 100644
--- a/lldb/tools/lldb-dap/Handler/RequestHandler.cpp
+++ b/lldb/tools/lldb-dap/Handler/RequestHandler.cpp
@@ -250,6 +250,7 @@ llvm::Error BaseRequestHandler::LaunchProcess(
if (error.Fail())
return llvm::make_error<DAPError>(error.GetCString());
} else {
+ dap.ignore_next_stop = true;
// Set the launch info so that run commands can access the configured
// launch details.
dap.target.SetLaunchInfo(launch_info);
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone once told me: "Whatever problem you're trying to solve, if you're using atomics, now you've got two problems."
I can't say I've always followed that advice, but I do think that atomics are rarely the right solution to a problem. And I think this is a good example of that.
What you want is not to suppress a event (whatever it might be). You want to suppress a very specific event (that is supposed to be generated by the action you're about to perform). That means that something needs to ensure that the other thread reads this flag after you've generated the event. If that's true, then you already have a happens-before relationship and the atomic is not necessary. If it isn't (there is no happens-before), then the atomic will not help.
That said, why is ConnectRemote generating a stopped event in synchronous mode? Maybe that's the bug we should fix?
We could block the DAP queue until we get the stop event before we send the llvm-project/lldb/tools/lldb-dap/Handler/LaunchRequestHandler.cpp Lines 70 to 77 in d7f096e
|
You're right. I have an implementation that used a mutex but it made the same mistake I have here with the atomic. Luckily that's easily solvable by putting the code that generates the stop and consumes the stop in the critical section.
That's a long standing bug that comes up about once or twice a year. I spend some time looking into that at some point, but it's so long ago that I don't remember what the blocker was. That said, solving that isn't sufficient for the problem at hand, since we need to solve the asynchronous case as well. |
bedc08b
to
1bdae98
Compare
I changed the implementation to use a mutex and address Pavel's concern. I'm even more unhappy with the current state of the patch. The (*) RE @ashgti's point: we could move this logic into |
|
||
{ | ||
// Temporary unlock the API mutex to avoid a deadlock between the API mutex | ||
// and the first stop mutex. | ||
lock.unlock(); | ||
|
||
// Block until we have either ignored the fist stop event or we didn't | ||
// generate one because we attached or launched in synchronous mode. | ||
std::unique_lock<std::mutex> stop_lock(dap.first_stop_mutex); | ||
dap.first_stop_cv.wait(stop_lock, [&] { | ||
return dap.first_stop_state == DAP::FirstStopState::NoStopEvent || | ||
dap.first_stop_state == DAP::FirstStopState::IgnoredStopEvent; | ||
}); | ||
|
||
// Relock the API mutex. | ||
lock.lock(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this above sending the response to the configurationDone
.
The initialized
event triggers the rest of the adapter configs (e.g. sending all the breakpoints) and the configurationDone
request should be the last in that chain.
In theory, could could even move the dap.WaitForProcessToStop(arguments.timeout);
call out of the attach/launch requests and have that here instead and fully expect the attach/launch flow to be async.
if (!lldb::SBProcess::GetRestartedFromEvent(event)) { | ||
SendStdOutStdErr(dap, process); | ||
SendThreadStoppedEvent(dap); | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we didn't start the event thread until the configurationDone
request was called?
Do we miss any events that would have happened before this point?
Or maybe we don't subscribe to process events until we get the configurationDone
request? Could that alleviate the need for this lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's not an option. Every process must have an event handler and IIUC (@jimingham should be able to confirm) if you don't handle the event, the process will never get out of the "launching" or "attaching" state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, but it should only matter if someone actually cares about the process in the mean time. Provided we don't need to inspect the state, a sequence like this should be fine:
- cause a process event to be emitted
- start the listener thread
- listener thread receives the event
But I don't know if that's the case here...
Another potential option: If we are sure the event is going to be emitted (and the listener thread is not running), maybe we can wait for the event (synchronously) on the thread which is processing the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Just throwing out some ideas that might simplify this. Would any of this help? |
To clarify the DAP flow a little, see 'Launch Sequencing' in https://microsoft.github.io/debug-adapter-protocol/overview Once the DAP server send the response to the
The flow from VSCode at least often ends up with:
But these do not strictly have to be in that order. The DAP server can respond to these events asynchronously but today the lldb-dap is single threaded for handling each of these responses. Looking at the failing test https://lab.llvm.org/buildbot/#/builders/18/builds/15174/steps/6/logs/stdio, we had the following sequence:
But that stopped event here is wrong. This attach did not set |
No, this is something we can control.
It's important to point out that in the DAP specification:
Technically the response of launch/attach should be the end of the chain. But because of the single threaded handling we can't do this. I don't think this is impacting the client side (at least for VSCode, it doesn't seem to do anything special upon an launch/attach response). But this suggests that lldb-dap should be able to async handling the requests/events. |
Thats not how its implemented in VS Code at least:
|
Yes but that code doesn't define how adapter should respond to those requests, the specification does. What you showed above doesn't conflict with having the response being the end of the launch sequence and the mark of "the debug session has started". They don't need to coordinate with |
Yes, that's the alternative I mentioned in the PR description:
Thought I need to confirm that this would be compliant with the spec. I also think that from a user experience point of view, you probably don't want to see the stop (similar to how LLDB doesn't promote every private stop to a public stop). When attaching, we always stop. For launching, we hide the initial stop unless you requested stop-at-entry. |
This is really interesting, thanks for pointing that out @kusmour. While it's true that the single threaded nature of how we handle requests makes it tricky to delay the response, we could delay handling the request altogether until the ConfigurationDone has been sent. For example, with the current queue approach, we could park the launch and attach requests off to the side until the configuration is done. |
This PR changes how we treat the launch sequence in lldb-dap. - Send the initialized event after we finish handling the initialize request, rather than after we finish attaching or launching. - Delay handling the launch and attach request until we have received the configurationDone request. The latter is now largely a NO-OP and only exists to signal lldb-dap that it can handle the launch and attach requests. - Add synchronization for ignoring the first stop event. I originally wanted to do all the launching and attaching in asynchronous mode, but that's a little bit trickier than expected. On macOS, I'm getting an additional stop when we hit the dyld breakpoint. This might be something I come back to, but for now I'm keeping the old behavior with the proposed synchronization from llvm#137920. Background: https://discourse.llvm.org/t/reliability-of-the-lldb-dap-tests/86125
I take it you're referring to the uncritical sections. What if we don't lock the mutex automatically for each handler (but leave it up to them to figure out when/if to lock it)? Or we could have the handler declare in some way whether it wants to run the the lock held? |
We've gotten multiple reports of the launch and attach test being flaky, both in CI and locally when running the tests. I believe the flakiness is due to a race between the main thread and the event handler thread.
Launching and attaching is done in synchronous mode so the corresponding requests return only after the respective operation has been completed. In synchronous mode, no stop event is emitted. When we have launch or attach commands, we cannot use synchronous mode. To hide the stop events in this case, the default event handler thread was ignoring stop events before the "configuration done" request was answered. The problem with that is that there's no guarantee that we have handled the stop event before we have handled the configuration done request.
When looking at the logs, you can see that we're still in the process of sending module events (which I recently added) when we receive, and respond to the "configuration done" request, before it sees the launch or attach stop event. At that point
dap.configuration_done_sent
is true and the event is sent, which the test doesn't expect.This PR fixes the raciness by using an atomic flag to signal that the next stop event should be ignored. An alternative approach could be to stop trying to hide the initial stop event, and instead report it to the client unconditionally. Instead of ignoring the stop for the asynchronous case, we could send a stop event after we're done handling the synchronous case.
Fixes #137660