Thanks to visit codestin.com
Credit goes to github.com

Skip to content

gdb: add "catch hiperr" to break on HIP API errors#129

Open
amd-shahab wants to merge 1 commit into
amd-stagingfrom
users/shvahedi/hip-catch-errs
Open

gdb: add "catch hiperr" to break on HIP API errors#129
amd-shahab wants to merge 1 commit into
amd-stagingfrom
users/shvahedi/hip-catch-errs

Conversation

@amd-shahab
Copy link
Copy Markdown
Contributor

@amd-shahab amd-shahab commented May 15, 2026

This change relies on "Add __hipOnError debugger hook for HIP API failures" PR to be functional. Although, without it, ROCgdb won't be broken. Please read the commit message for the description of this change itself.

--

What

This change adds a new feature to GDB, when it is targeted for
amdgcn-amd-amdhsa, that enables it to stop whenever HIP API
errors are returned. An abridged sample of execution looks
like this:

(gdb) catch hiperr
(gdb) run
Thread 1 "hip" hit Catchpoint 1 (hip error)
hipErrorInvalidDevice (101): invalid device ordinal

Moreover, a $_hiperr convenience variable is added that holds
the error code. For further details, please see the updated
documentation in this change.

Design

Most of the implementation is in the newly added module called
break-catch-hiperr.c. The naming and its context follows the
break-catch-throw.c module.

The whole mechanism work by setting a breakpoint (catchpoint)
on a __hipOnError() symbol, that must be provided by the CLR
library. Enabling the catchpoint when there's no __hipOnError
symbol, simply has no effect.

There's a architect dependent section that is responsible for
extracting the error parameters (code, name, description) from
a single argument passed to the __hipOnError() function as a
structure pointer. The expected format of this struct is:

uint32_t version; // at this point, only 1 is supported
uint32_t code; // the numerical error code
const char *name; // name of the error
const char *desc; // a phrase about the error

The code is devised in such a way that if the __hipOnError()
exists, but for whatever reason the parameters cannot be deduced,
it won't interfere with the catchpoint being triggered. It will
affect the reported data of the catchpoint though.

Test

A new test, gdb.rocm/hip-catch-errors, is added to the mix to
ensure that:

  1. (Pending) catchpoints work.
  2. Multiple catchpoints in one execution are reported accordingly.
  3. Errors of indirect HIP calls like "kenel<<<...>>>()" are caught.
  4. The convenience variable can be used for conditional situations.
  5. Conditional and temporary catchpoints work as expected.

ROCM-1548

@amd-shahab amd-shahab self-assigned this May 15, 2026
@amd-shahab amd-shahab requested a review from a team as a code owner May 15, 2026 18:55
@lumachad
Copy link
Copy Markdown
Collaborator

@amd-shahab If this shouldn't be tested yet and isn't mergeable until the dependent PR, feel free to label it "ci:skip".

Copy link
Copy Markdown
Collaborator

@lancesix lancesix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First quick first pass of comments.

One question regarding the test: what kind of guarantee do we have that the error code and error string are going to be stable over time? Let's imagine that the hip runtime decides to localize the error string (not the name). Couldn't that be a legitimate thing to do, but it would break the test, would it?

Comment thread gdb/doc/gdb.texinfo Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/amd64-tdep.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
@amd-shahab amd-shahab added the ci:skip Skip all pre-commit / CI jobs while the label is up label May 18, 2026
Comment thread gdb/doc/gdb.texinfo Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp
Comment thread gdb/amd64-tdep.c
Comment thread gdb/amd64-tdep.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 823bf4f to 745f0bc Compare May 18, 2026 10:43
@amd-shahab
Copy link
Copy Markdown
Contributor Author

amd-shahab commented May 18, 2026

  • error string not part of API

  • __hipOnError -> @samp{__hipOnError}

  • "int main" -> "int\nmain"

  • gdb_test -> gdb_test_on_output

  • target_read_stack -> target_read_memory

  • drop the "ignored" param name

  • Else -> Otherwise

  • NULL -> nullptr

  • removed dangling "backtrace" command from the documentation

  • !arg -> arg != nullptr

  • binay -> binary

  • relvant -> relevant

  • bool print_one (const bp_location **) -> bool print_one (const bp_location **last_loc)

  • rename variable "x" to "dontcare"

  • comment on true/false return of amd64_fetch_hiperr_parameters()

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 745f0bc to 1da282b Compare May 18, 2026 10:52
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • emit a warning when the "version" is not expected.

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 1da282b to d616325 Compare May 18, 2026 12:37
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • make the invocation of the test fully argument dependent
  • add comments for every method in break-catch-hiperr.c

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from d616325 to 4b40f98 Compare May 18, 2026 14:19
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • check for the existence of __hipOnError symbol. If not found, skip the test.

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 4b40f98 to 5005bc1 Compare May 18, 2026 19:21
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • check for the right frame (__hipOnError)

Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch 5 times, most recently from 81df767 to 5a4629e Compare May 19, 2026 14:34
@amd-shahab amd-shahab removed the ci:skip Skip all pre-commit / CI jobs while the label is up label May 19, 2026
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch 3 times, most recently from b2e7ebe to 3ff9b7d Compare May 20, 2026 14:14
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 257e870 to 66a737d Compare May 21, 2026 12:47
Comment thread gdb/gdbarch_components.py Outdated
Comment thread gdb/gdbarch_components.py Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/amd64-tdep.h Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/doc/gdb.texinfo Outdated
@aktemur
Copy link
Copy Markdown
Contributor

aktemur commented May 26, 2026

If I set the feature with "catch hiperr" and then do an infcall with "print hipSetDevice(98765)", what's GDB's behavior? Can you please check?

@lancesix
Copy link
Copy Markdown
Collaborator

If I set the feature with "catch hiperr" and then do an infcall with "print hipSetDevice(98765)", what's GDB's behavior? Can you please check?

I have not tested, but am highly confident that this will trigger the catch point.

@aktemur
Copy link
Copy Markdown
Contributor

aktemur commented May 26, 2026

If I set the feature with "catch hiperr" and then do an infcall with "print hipSetDevice(98765)", what's GDB's behavior? Can you please check?

I have not tested, but am highly confident that this will trigger the catch point.

I also think the catchpoint will be triggered but just in case I wanted make sure GDB can handle the infcall interruption correctly. It would be great to test with and without set unwind-on-signal on. (I cannot test because I don't have the hipOnError symbol.

@amd-shahab
Copy link
Copy Markdown
Contributor Author

amd-shahab commented May 27, 2026

If I set the feature with "catch hiperr" and then do an infcall with "print hipSetDevice(98765)", what's GDB's behavior? Can you please check?

As Lancelot predicted:

(gdb) catch hiperr
Catchpoint 2 (HIP error)
(gdb) p hipSetDevice(98765)
[New Thread 0x7ffff4fff6c0 (LWP 8965) (id 2)]
[New Thread 0x7ffff47fe6c0 (LWP 8966) (id 3)]
[New Thread 0x7ffff3cff6c0 (LWP 8967) (id 4)]
[Thread 0x7ffff3cff6c0 (LWP 8967) (id 4) exited]

Thread 1 "hip-catch-error" hit Catchpoint 2 (HIP error)
HIP API call failed with error hipErrorInvalidDevice (101): invalid device ordinal

The $_hiperr holds the error number.
A "backtrace" will show the call stack.
The program being debugged stopped while in a function called from GDB.
Evaluation of the expression containing the function
(hipSetDevice(int)) will be abandoned.
When the function is done executing, GDB will silently stop.


(gdb) bt
#0  __hipOnError (err_info=0x7fffffffd7a0) at /src/clr/hipamd/src/hip_error.cpp:11
#1  0x00007ffff62863e9 in hip::hipSetDevice (device=98765) at /src/clr/hipamd/src/hip_device_runtime.cpp:784
#2  0x00007ffff668440f in hipSetDevice (deviceId=98765) at /src/clr/hipamd/src/hip_table_interface.cpp:2133
#3  <function called from gdb>
#4  main (argc=1, argv=0x7fffffffe108) at /src/rocgdb/gdb/testsuite/gdb.rocm/hip-catch-errors.cpp:26

With set unwind-on-signal on (default is off), we get the same output. "Signal"s have nothing to do with these errors.

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch 2 times, most recently from 5c918d6 to 0ac52b6 Compare May 27, 2026 14:42
@amd-shahab
Copy link
Copy Markdown
Contributor Author

amd-shahab commented May 27, 2026

  • In catchpoint output, refer to "$_hiperr" as a "convenience variable"
  • In catchpoint output, change the "backtrace" section to the suggested sentence
  • gdb.texinfo updated to reflect the latest output.
  • Use "$::hip_" instead of "global hip_"
  • Use ATTRIBUTE_UNUSED
  • rename the fetch_hiperr parameter names to err_{no,name,str}
  • a few typos are fixed

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 0ac52b6 to 61d1b84 Compare May 28, 2026 12:48
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • test: Add an implicit and faulty kernel<<<1, big_size>>> call to HIP API as test case
  • test: Make the routine for getting reference values and causing the errors the same

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 61d1b84 to 11fde27 Compare May 28, 2026 12:57
Copy link
Copy Markdown
Collaborator

@palves palves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit log:

#1 At least the GDB example output is stale.

#2 "at this point, only 1 is supported" should mention backward compatibility model.

#3 The "The code is devised in such a way that" paragraph sounds like something that should live as a comment in the sources, too.

#4 Typos:
"The whole mechanism work by setting" => "works"
"There's a architect dependent section" => "there is an architecture-dependent section" (also "a" => "an")
"kenel<<<...>>>()" => "kernel<<<...>>>()"

#5 Pick one term and use it consistently in commit log, code messages, and docs.

Currently:

  • break-catch-hiperr.c says "error number"
  • commit log says "error code"
  • docs say "the numerical error code" and "holds the error number"

Comment thread gdb/doc/gdb.texinfo Outdated
Comment thread gdb/doc/gdb.texinfo Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/amd64-tdep.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/amd64-linux-tdep.c Outdated
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch 2 times, most recently from f3b2ea4 to 9d43e62 Compare June 1, 2026 20:27
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • catch code: Add space before : in struct hiperr_catchpoint: public code_breakpoint
  • catch code: get rid of unnecessary line break when adding the "catch hiperr" command
  • catch code: Remove mention of backtrace when the catchpoint is hit
  • catch code: add a comment that unsucessful fetching of parameters is not a deal breaker
  • amd64-tdep.c: "not recognized" -> "not supported"
  • amd64-tdep.c: fix bad indentation on line starting with pulongest
  • test: anchor the $str in regex HIP API call failed with error $name \\($code\\): $str.*
  • test: use lassign instead of repeated lindexs
  • test: "same codes are executed" -> "same code is executed"
  • test: remove the space in function calls like ... hipGetErrorName ($err)
  • doc: use @value{GDBP} instead of (gdb) in example outputs
  • doc: turn hip errors in example outputs to HIP errors
  • doc: The @samp{$_hiperr} can -> The @samp{$_hiperr} convenience variable can
  • commit message:
    . fix typos
    . update the stale outputs
    . use error number, name, and string (same in production and test code)
    . mention backward compatibility of the versions

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 9d43e62 to 4a5f4d2 Compare June 2, 2026 15:33
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • test: combine temporary and non-temporary procedures
  • test: generate the reference values in the test program itself and read them (instead of inferior calls)

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 4a5f4d2 to 93caa8d Compare June 2, 2026 20:19
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • code: fetch_hiperr_parameters() returns the parameters in an optional struct.
  • code: fetch_hiperr_parameters() gets the frame as input
  • code: amd64-tde.c reads the structure in one go.
  • code: ATTRIBUTE_UNUSED was replaced with commented out variables

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch 2 times, most recently from 7dcd60c to 0ef26a8 Compare June 3, 2026 11:05
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • Add intro comment for every function/method/procedures in break-catch-hiperr.c and hip-catch-errors.exp
  • Use current_program_space instead of this->pspace when calling the update_breakpoint_locations ()

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 0ef26a8 to 311285b Compare June 3, 2026 11:26
@amd-shahab
Copy link
Copy Markdown
Contributor Author

  • Use internationalization for the string_printf () string
  • Comment more about the version in amd64-tdep.c

@amd-shahab amd-shahab requested a review from palves June 3, 2026 11:47
Copy link
Copy Markdown
Collaborator

@palves palves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Ran out of time before I reached the bottomm.

and have the error parameters recorded. GDB can use these recorded
values as references to verify the catchpoints outputs. */

/* Dispatch a kernel, implicitly, with an oversized dimension. */
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does implicitly in "Dispatch a kernel, implicitly," mean?

Copy link
Copy Markdown
Contributor Author

@amd-shahab amd-shahab Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kernel<<<...>>> () becomes hipLaunchKernel (... more args ...) under the hood.

Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/amd64-tdep.c Outdated
Comment thread gdb/amd64-tdep.c
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/defs.h Outdated
What
----
This change adds a new feature to GDB, when it is targeted for
amdgcn-amd-amdhsa, that enables it to stop whenever HIP API
errors are returned.  An abridged sample of execution looks
like this:

(gdb) catch hiperr
(gdb) run
Thread 1 "prog" hit Catchpoint 1 (hip error)
HIP API call failed with error hipErrorInvalidDevice (101):
invalid device ordinal

Moreover, a $_hiperr convenience variable is added that holds
the error number.  For further details, please see the updated
documentation in this change.

Design
------
Most of the implementation is in the newly added module called
break-catch-hiperr.c.  The naming and its context follows the
break-catch-throw.c module.

The whole mechanism works by setting a breakpoint (catchpoint)
on a __hipOnError() symbol, that must be provided by the CLR
library.  Enabling the catchpoint when there's no __hipOnError
symbol, simply has no effect.

There's an architecture-dependent section that is responsible
for extracting the error parameters (error number, error name,
error string) from a single argument passed to the __hipOnError()
function as a structure pointer.  The expected format of this
struct is:

  uint32_t    version;   // for now version is 1; future ones will
                         // be backward compatible by means of
                         // extending this structure
  uint32_t    err_no;    // the error number
  const char *err_name;  // name of the error
  const char *err_str;   // a phrase about the error

The code is devised in such a way that if the __hipOnError()
exists, but for whatever reason the parameters cannot be deduced,
it won't interfere with the catchpoint being triggered.  It will
affect the reported data of the catchpoint though.

Test
----
A new test, gdb.rocm/hip-catch-errors, is added to the mix to
ensure that:

1. (Pending) catchpoints work.
2. Multiple catchpoints in one execution are reported accordingly.
3. Errors of indirect HIP calls like "kernel<<<...>>>()" are caught.
4. The convenience variable can be used for conditional situations.
5. Conditional and temporary catchpoints work as expected.

Change-Id: Iff0808350cf7856da17ff07be8ed10c76f9370f8
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 311285b to e749d12 Compare June 3, 2026 19:25
@amd-shahab amd-shahab requested a review from palves June 4, 2026 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants