gdb: add "catch hiperr" to break on HIP API errors#129
Conversation
|
@amd-shahab If this shouldn't be tested yet and isn't mergeable until the dependent PR, feel free to label it "ci:skip". |
lancesix
left a comment
There was a problem hiding this comment.
First quick first pass of comments.
One question regarding the test: what kind of guarantee do we have that the error code and error string are going to be stable over time? Let's imagine that the hip runtime decides to localize the error string (not the name). Couldn't that be a legitimate thing to do, but it would break the test, would it?
823bf4f to
745f0bc
Compare
|
745f0bc to
1da282b
Compare
|
1da282b to
d616325
Compare
|
d616325 to
4b40f98
Compare
|
4b40f98 to
5005bc1
Compare
|
81df767 to
5a4629e
Compare
b2e7ebe to
3ff9b7d
Compare
257e870 to
66a737d
Compare
|
If I set the feature with "catch hiperr" and then do an infcall with "print hipSetDevice(98765)", what's GDB's behavior? Can you please check? |
I have not tested, but am highly confident that this will trigger the catch point. |
I also think the catchpoint will be triggered but just in case I wanted make sure GDB can handle the infcall interruption correctly. It would be great to test with and without |
As Lancelot predicted: With |
5c918d6 to
0ac52b6
Compare
|
0ac52b6 to
61d1b84
Compare
|
61d1b84 to
11fde27
Compare
palves
left a comment
There was a problem hiding this comment.
In the commit log:
#1 At least the GDB example output is stale.
#2 "at this point, only 1 is supported" should mention backward compatibility model.
#3 The "The code is devised in such a way that" paragraph sounds like something that should live as a comment in the sources, too.
#4 Typos:
"The whole mechanism work by setting" => "works"
"There's a architect dependent section" => "there is an architecture-dependent section" (also "a" => "an")
"kenel<<<...>>>()" => "kernel<<<...>>>()"
#5 Pick one term and use it consistently in commit log, code messages, and docs.
Currently:
- break-catch-hiperr.c says "error number"
- commit log says "error code"
- docs say "the numerical error code" and "holds the error number"
f3b2ea4 to
9d43e62
Compare
|
9d43e62 to
4a5f4d2
Compare
|
4a5f4d2 to
93caa8d
Compare
|
7dcd60c to
0ef26a8
Compare
|
0ef26a8 to
311285b
Compare
|
palves
left a comment
There was a problem hiding this comment.
Some comments. Ran out of time before I reached the bottomm.
| and have the error parameters recorded. GDB can use these recorded | ||
| values as references to verify the catchpoints outputs. */ | ||
|
|
||
| /* Dispatch a kernel, implicitly, with an oversized dimension. */ |
There was a problem hiding this comment.
What does implicitly in "Dispatch a kernel, implicitly," mean?
There was a problem hiding this comment.
kernel<<<...>>> () becomes hipLaunchKernel (... more args ...) under the hood.
What
----
This change adds a new feature to GDB, when it is targeted for
amdgcn-amd-amdhsa, that enables it to stop whenever HIP API
errors are returned. An abridged sample of execution looks
like this:
(gdb) catch hiperr
(gdb) run
Thread 1 "prog" hit Catchpoint 1 (hip error)
HIP API call failed with error hipErrorInvalidDevice (101):
invalid device ordinal
Moreover, a $_hiperr convenience variable is added that holds
the error number. For further details, please see the updated
documentation in this change.
Design
------
Most of the implementation is in the newly added module called
break-catch-hiperr.c. The naming and its context follows the
break-catch-throw.c module.
The whole mechanism works by setting a breakpoint (catchpoint)
on a __hipOnError() symbol, that must be provided by the CLR
library. Enabling the catchpoint when there's no __hipOnError
symbol, simply has no effect.
There's an architecture-dependent section that is responsible
for extracting the error parameters (error number, error name,
error string) from a single argument passed to the __hipOnError()
function as a structure pointer. The expected format of this
struct is:
uint32_t version; // for now version is 1; future ones will
// be backward compatible by means of
// extending this structure
uint32_t err_no; // the error number
const char *err_name; // name of the error
const char *err_str; // a phrase about the error
The code is devised in such a way that if the __hipOnError()
exists, but for whatever reason the parameters cannot be deduced,
it won't interfere with the catchpoint being triggered. It will
affect the reported data of the catchpoint though.
Test
----
A new test, gdb.rocm/hip-catch-errors, is added to the mix to
ensure that:
1. (Pending) catchpoints work.
2. Multiple catchpoints in one execution are reported accordingly.
3. Errors of indirect HIP calls like "kernel<<<...>>>()" are caught.
4. The convenience variable can be used for conditional situations.
5. Conditional and temporary catchpoints work as expected.
Change-Id: Iff0808350cf7856da17ff07be8ed10c76f9370f8
311285b to
e749d12
Compare
This change relies on "Add __hipOnError debugger hook for HIP API failures" PR to be functional. Although, without it, ROCgdb won't be broken. Please read the commit message for the description of this change itself.
--
What
This change adds a new feature to GDB, when it is targeted for
amdgcn-amd-amdhsa, that enables it to stop whenever HIP API
errors are returned. An abridged sample of execution looks
like this:
(gdb) catch hiperr
(gdb) run
Thread 1 "hip" hit Catchpoint 1 (hip error)
hipErrorInvalidDevice (101): invalid device ordinal
Moreover, a $_hiperr convenience variable is added that holds
the error code. For further details, please see the updated
documentation in this change.
Design
Most of the implementation is in the newly added module called
break-catch-hiperr.c. The naming and its context follows the
break-catch-throw.c module.
The whole mechanism work by setting a breakpoint (catchpoint)
on a __hipOnError() symbol, that must be provided by the CLR
library. Enabling the catchpoint when there's no __hipOnError
symbol, simply has no effect.
There's a architect dependent section that is responsible for
extracting the error parameters (code, name, description) from
a single argument passed to the __hipOnError() function as a
structure pointer. The expected format of this struct is:
uint32_t version; // at this point, only 1 is supported
uint32_t code; // the numerical error code
const char *name; // name of the error
const char *desc; // a phrase about the error
The code is devised in such a way that if the __hipOnError()
exists, but for whatever reason the parameters cannot be deduced,
it won't interfere with the catchpoint being triggered. It will
affect the reported data of the catchpoint though.
Test
A new test, gdb.rocm/hip-catch-errors, is added to the mix to
ensure that:
ROCM-1548