Thanks to visit codestin.com
Credit goes to github.com

Skip to content

inlineasm: Add inline assembler support for RV32. #15714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 2, 2025

Conversation

agatti
Copy link
Contributor

@agatti agatti commented Aug 25, 2024

Summary

This commit adds support for writing inline assembler functions when targeting a RV32IMC processor.

An almost complete test suite is also present, covering 95% of the opcodes (leaving out things like breakpoint triggers and the like).

Testing

This was tested both on QEMU-RISCV and on an ESP32-C3 after manually enabling the inline assembler emitter in mpconfigport.h.

Due to the "interesting" limitations about immediates' size and registers set that vary from opcode to opcode, the code has a lot of checks for invalid input, along with (hopefully) helpful error messages as part of the exceptions it raises. However, unless I'm doing something really wrong, I cannot seem to be able to trap those exceptions in tests. That would be helpful for further validation of range and registers checking.

Trade-offs and Alternatives

As mentioned in the commit message, this bit of code takes up a fair amount of .rodata space (last time I checked it was something like 4.5k) and thus it is not enabled by default unless it's running on anything that's not a microcontroller. Since the Pico2 port has Arm inline assembler support enabled, this is also enabled for said port to maintain feature parity across MCU architectures.

Syntax-wise, there are a few things that may need a bit of explanation

  • Compressed opcodes start with the c. prefix, which I replaced with c_ - so for example c.srai turns into c_srai.
  • When defining load/store opcodes the immediate is in the last operand, so lw a0, 10(a1) is defined as lw(a0, a1, 10). This is to simplify syntax parsing, as I don't think the original syntax can be parsed as such, and should be easier to grasp for folks coming from Arm assembler.
  • Since writing RISC-V assembly gets painful quickly if you have to deal with inline constants, two meta opcodes have been implemented (following GAS), namely li Rd, Imm32 and la Rd, Label. The first loads the given immediate into the given register using as few bytes as possible, the second loads the address of the given label into the given register.
  • c_mov(Rd, Rs) has been aliased to mov(Rd, Rs).
  • To keep code size down (most of the code in this PR is error checking anyway), there is no support for opcodes with multiple argument types (eg. add(a0, a1) becoming C.ADD A0, A1 or ADD(A0, A0, A1) behind the scenes, or add(a0, 100) turning into ADDI A0, A0, 100).

Copy link

codecov bot commented Aug 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.59%. Comparing base (c732041) to head (931a768).
Report is 6 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #15714   +/-   ##
=======================================
  Coverage   98.59%   98.59%           
=======================================
  Files         167      167           
  Lines       21617    21617           
=======================================
  Hits        21313    21313           
  Misses        304      304           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Aug 25, 2024

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
     mimxrt:    +0 +0.000% TEENSY40
        rp2:    +0 +0.000% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32: +32280 +8.046% VIRT_RV32[incl +20480(bss)]

@dpgeorge dpgeorge added py-core Relates to py/ directory in source board-definition New or updated board definition files. Combine with a port- label. labels Aug 26, 2024
@agatti
Copy link
Contributor Author

agatti commented Sep 3, 2024

@dpgeorge I've resolved the conflict blocking this PR, but now the issue with Arm tests has gotten worse. Before the qemu-arm refactoring it'd take longer than 30 seconds, but would still end. Now it's been stuck for a couple hours already.

Am I missing something in e2b5402 (#15714) when moving files around? I can't see what's wrong, unless there's some extra logic elsewhere I haven't updated.

@dpgeorge
Copy link
Member

dpgeorge commented Sep 5, 2024

but now the issue with Arm tests has gotten worse. Before the qemu-arm refactoring it'd take longer than 30 seconds, but would still end. Now it's been stuck for a couple hours already.

It looks like this issue with CI is resolved now? The qemu-arm CI job now passes without issue, it seems.

@agatti
Copy link
Contributor Author

agatti commented Sep 5, 2024

Yes, that issue was indeed resolved - however the inlineasm tests for RV32 won't be picked up until #15784 is merged (and until this PR gets rebased once more).

@agatti agatti force-pushed the inlineasm-riscv32 branch 2 times, most recently from 3bb290e to afb0102 Compare September 6, 2024 11:11
@agatti
Copy link
Contributor Author

agatti commented Sep 6, 2024

Here's the rebased version taking the merged qemu port into account. I've run the tests locally in a VM being as close as possible to the CI environment, and on an ESP32C3 board.

Running tests for MPS2_AN385, SABRELITE, and VIRT_RV32 qemu boards work fine, but MICROBIT and NETDUINO2 targets do not work at all (even without this PR). Are those boards meant to be compile-only?

MICROBIT test run:

dev@mpy-dev-rv32:/micropython/ports/qemu$ make BOARD=MICROBIT test V=1
python3 ../../py/makeversionhdr.py build-MICROBIT/genhdr/mpversion.h
test -e "../../lib/micropython-lib/README.md" || (echo -e "\033[1;31mError: micropython-lib submodule is not initialized.\033[0m Run 'make submodules'"; false)
python3 ../../tools/makemanifest.py -o build-MICROBIT/frozen_content.c -v "PORT_DIR=/micropython/ports/qemu" -v "MPY_DIR=../.." -v "BOARD_DIR=" -v "MPY_LIB_DIR=../../lib/micropython-lib" -b "build-MICROBIT" -f"-march=armv7m" --mpy-tool-flags="" "freeze('test-frzmpy', ('frozen_asm_thumb.py', 'frozen_const.py', 'frozen_viper.py', 'native_frozen_align.py'))"
cd ../../tests && ./run-tests.py --target qemu --device execpty:"qemu-system-arm -global nrf51-soc.flash-size=1048576 -global nrf51-soc.sram-size=262144 -machine microbit -nographic -monitor null -semihosting  -serial pty -kernel ../ports/qemu/build-MICROBIT/firmware.elf" --exclude 'inlineasm/rv32imc' 
# HANGS HERE

NETDUINO2 test run:

dev@mpy-dev-rv32:/micropython/ports/qemu$ make BOARD=NETDUINO2 test V=1
python3 ../../py/makeversionhdr.py build-NETDUINO2/genhdr/mpversion.h
test -e "../../lib/micropython-lib/README.md" || (echo -e "\033[1;31mError: micropython-lib submodule is not initialized.\033[0m Run 'make submodules'"; false)
python3 ../../tools/makemanifest.py -o build-NETDUINO2/frozen_content.c -v "PORT_DIR=/micropython/ports/qemu" -v "MPY_DIR=../.." -v "BOARD_DIR=" -v "MPY_LIB_DIR=../../lib/micropython-lib" -b "build-NETDUINO2" -f"-march=armv7m" --mpy-tool-flags="" "freeze('test-frzmpy', ('frozen_asm_thumb.py', 'frozen_const.py', 'frozen_viper.py', 'native_frozen_align.py'))"
cd ../../tests && ./run-tests.py --target qemu --device execpty:"qemu-system-arm -machine netduino2 -nographic -monitor null -semihosting  -serial pty -kernel ../ports/qemu/build-NETDUINO2/firmware.elf" --exclude 'inlineasm/rv32imc' 
b'MicroPython v1.24.0-preview.297.g34fbf1d55 on 2024-09-06; netduino2 with STM32\r\nType "help()" for more information.\r\n>>> '
Traceback (most recent call last):
  File "/micropython/tests/./run-tests.py", line 1188, in <module>
    main()
  File "/micropython/tests/./run-tests.py", line 1084, in main
    pyb.enter_raw_repl()
  File "/micropython/tools/pyboard.py", line 365, in enter_raw_repl
    raise PyboardError("could not enter raw repl")
pyboard.PyboardError: could not enter raw repl
make: *** [Makefile:169: test] Error 1

@dpgeorge
Copy link
Member

dpgeorge commented Sep 6, 2024

but MICROBIT and NETDUINO2 targets do not work at all (even without this PR). Are those boards meant to be compile-only?

You are right, these boards don't run the tests on master.

Looking into it it seems the problems are:

  • NETDUINO2: requires a time.sleep(0.1) in tools/pyboard.py, ProcessPtyToTerminal::__init__, just before creating the serial.Serial(pty, ...) instance. That gives qemu enough time to start up before trying to access the serial port.
  • MICROBIT: probably needs the same patch as NETDUINO2, but then also locks up running the tests/feature_check/native_check.py. Solution would be to either debug why it crashes on that and fix it, or just disable the native emitter on this board.

This is unrelated to this PR and so out of scope. But feel free to fix it if you like 😄

@agatti
Copy link
Contributor Author

agatti commented Sep 6, 2024

@dpgeorge fyi, MICROBIT crashes when accessing the function table, which doesn't make much sense.

Edit: unless the ROM cannot be read from RAM or some other protection mechanism like that? Adding (xr) to ROM and (xrw) to RAM in the nrf51 linkerscript didn't help anyway.

Edit2: yep, attaching a debugger to an emulator that's debugging a binary tells it all. nRF51 needs some MMU magic beforehand.

...
(gdb) 
0x20000f74 in gc_heap ()
=> 0x20000f74 <gc_heap+3192>:   00 a8   add     r0, sp, #0
(gdb) 
0x20000f76 in gc_heap ()
=> 0x20000f76 <gc_heap+3194>:   d7 f8 b4 40     ldr.w   r4, [r7, #180]  ; 0xb4
(gdb) print/a $r7
$3 = 0x28fe8 <mp_fun_table>
(gdb) ni
0x0001ff8a in HardFault_Handler ()
=> 0x0001ff8a <HardFault_Handler+0>:    03 20   movs    r0, #3
(gdb) x/4a mp_fun_table+180
0x2909c <mp_fun_table+180>:     0x1b689 <mp_setup_code_state_native>    0x1bb9b <mp_small_int_floor_divide>     0x1bb7d <mp_small_int_modulo>   0xe23d <mp_native_yield_from>

@agatti
Copy link
Contributor Author

agatti commented Nov 20, 2024

This should now work with the changes to the test runner made after 1.24.0.

This PR also makes the test runner able to detect if inline assembler support is enabled and if so for which architecture. If so, the tests/inlineasm/$INLINEASM_ARCH/ directory is automatically added to the test suite. On esp32, qemu, and rp2 it will pick up the correct test files according to what arch is currently in use (allowing for future inlineasm additions like RV64 and other architectures/variants).

@dpgeorge dpgeorge removed the board-definition New or updated board definition files. Combine with a port- label. label Nov 20, 2024
@dpgeorge dpgeorge added this to the release-1.25.0 milestone Nov 20, 2024
@agatti agatti force-pushed the inlineasm-riscv32 branch from 66564ad to 4aefb36 Compare December 2, 2024 20:37
@agatti
Copy link
Contributor Author

agatti commented Dec 11, 2024

I'm wondering on what's the policy for new test files after the introduction of unittest.

Should the test files contained in this PR be moved to that framework (as in, is it a blocking issue) or is it something that can be done at a later stage (not necessarily by myself, either)?

@dpgeorge
Copy link
Member

I'm wondering on what's the policy for new test files after the introduction of unittest.

Using unittest is only for convenience, if it makes writing the test easier. So you don't need to change anything in this PR.

assert(offset_node != NULL && "Offset node pointer is NULL.");
assert(negative != NULL && "Negative pointer is NULL.");

if (!MP_PARSE_NODE_IS_STRUCT_KIND(node, PN_atom_expr_normal) && !MP_PARSE_NODE_IS_STRUCT_KIND(node, PN_factor_2)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation logic can be simplified, in pseudo code:

if MP_PARSE_NODE_IS_STRUCT_KIND(node, PN_factor_2):
    if MP_PARSE_NODE_IS_TOKEN_KIND(node_struct->nodes[0], MP_TOKEN_OP_MINUS):
        *negative = true;
    elif MP_PARSE_NODE_IS_TOKEN_KIND(node_struct->nodes[0], MP_TOKEN_OP_PLUS):
        // pass
    else:
        goto invalid;
    node = node_struct->nodes[1]

if !MP_PARSE_NODE_IS_STRUCT_KIND(node, PN_atom_expr_normal):
    goto invalid

Note that PN_factor_2 and PN_atom_expr_normal always have 2 elements so there's no need to check that.

Copy link
Contributor Author

@agatti agatti Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I ended up doing something similar, but I left the elements count check. I'll take that out.

Right now I'm cleaning up the range checks, as negative offsets are still seen as positive due to the offset node being without its minus sign and you can imagine the rest. Right now there are lots of repeated code chunks for the special cases and I'm merging them together.

@dpgeorge
Copy link
Member

When an rv32 inline-asm function is entered into, what registers are automatically saved? Is there anything else done in the prelude?

With Thumb inline-asm, a few registers are saved/restored on function entry/exit. This was arguably not a good choice, it might have been a better choice to have a very minimal prelude for inline-asm function.

@agatti
Copy link
Contributor Author

agatti commented Dec 28, 2024

When an rv32 inline-asm function is entered into, what registers are automatically saved? Is there anything else done in the prelude?

The prelude saves the function table pointer and the three local registers. The reason behind that is purely because they're stored in S-registers and those registers must be restored on exit (in other words, A0-A7 and T0-T6 can be trashed as much as you want, but S0-S11 must be preserved).

In theory T-registers or even the A4 to A7 could be have been used for those, but then half of the compressed opcodes wouldn't be able to access them.

In fact, the emitter just calls asm_rv32_entry/asm_rv32_exit and one of the changes to py/asmrv32.c present in this PR is to save an internal temporary S-register only whenever there's an emitted function call (that's required to shorten most function calls to just 2 opcodes/4 bytes).

For inline assembly code it's up to the user to keep track of RA, SP, TP, GP, and S0-S11, but that's expected of the user in the first place no matter the assembler used.

@agatti agatti force-pushed the inlineasm-riscv32 branch 2 times, most recently from 89071c8 to d7b61e4 Compare December 28, 2024 12:37
@agatti
Copy link
Contributor Author

agatti commented Dec 28, 2024

OK, this should be it I guess. I wish the offset-to-register syntax support wasn't bolted on this way, but it works :) That can be cleaned up for v2 anyway.

@dpgeorge
Copy link
Member

the offset-to-register syntax support wasn't bolted on this way

What do you mean, that you'd prefer it was designed it from the start?

@agatti
Copy link
Contributor Author

agatti commented Dec 28, 2024

Indeed, everything else fits in nicely without special cases and exceptions. It's more of a personal preference, really, but re-designing everything to cover that case in a generic way as well would take a bit more time and delay this PR for the 1.25.0 milestone.

Anyway, I'll rework that for v2 when adding dotted opcodes support - in the end it works and the code isn't that bad, but it could be better for sure :)

@dpgeorge
Copy link
Member

OK, sounds fine to me. I'm glad that the syntax 10(a0) is now supported, I think that's a good improvement.

@dpgeorge
Copy link
Member

The prelude saves the function table pointer and the three local registers.

Just to be clear, this is necessary for @native/@viper code generation. But for inline-rv32-asm it's not strictly needed, but done just because inline-rv32-asm is using asm_rv32_entry/asm_rv32_exit for convenience. Is that true?

Would it be possible for inline-rv32-asm to use a different entry/exit set of functions and not save any registers automatically? As you say, it's up to the assembler user to keep track of and restore needed registers.

@agatti
Copy link
Contributor Author

agatti commented Dec 28, 2024

Done. Maybe when more targets will support cm.push/cm.popret opcodes (right now only the pico2 does) it can be feasible to automatically save/restore RA + S0..S11 with an explicit opt-out via an argument to the decorator (that's 2 opcodes/4 bytes vs ~60 bytes/~25 opcodes).

@agatti
Copy link
Contributor Author

agatti commented Dec 28, 2024

Apologies for the extra push, it's just a fix for a minor issue I found when reviewing the code for the last time (the check for negative values' range didn't take into account that the negative flag was a pointer), plus some commented test code was removed.

Copy link
Member

@dpgeorge dpgeorge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very good now. I particularly like how the tests suite can now auto-detect the inline-asm capability and arch, that's nice!

I have a few final comments and then it should be good to go in.

agatti added 2 commits January 1, 2025 10:44
Thumb/Thumb2 tests are now into their own subdirectory, as
RV32IMC-specific tests will be added as part of the RV32 inline
assembler support.

Signed-off-by: Alessandro Gatti <[email protected]>
This makes the existing popcount(uint32_t) implementation found in the
RV32 emitter available to the rest of the codebase.  This version of
popcount will use intrinsic or builtin implementations if they are
available, falling back to a generic implementation if that is not the
case.

Signed-off-by: Alessandro Gatti <[email protected]>
@agatti agatti force-pushed the inlineasm-riscv32 branch from c3af779 to 5baa894 Compare January 1, 2025 09:50
agatti added 4 commits January 2, 2025 11:49
This commit adds support for writing inline assembler functions when
targeting a RV32IMC processor.

Given that this takes up a bit of rodata space due to its large
instruction decoding table and its extensive error messages, it is
enabled by default only on offline targets such as mpy-cross and the
qemu port.

Signed-off-by: Alessandro Gatti <[email protected]>
In certain circumstances depending on the code size, the
`deflate_decompress` test fails on both ARM and RV32 with a memory
allocation failure error.  The issue is mitigated by having a larger GC
heap, in this case around 20 KBytes more than the original 100 KBytes
default.

This commit makes the GC heap size configurable on a per-arch basis, with
both ARM and RV32 using the enlarged 120 KBytes heap.

Signed-off-by: Alessandro Gatti <[email protected]>
This commit enables by default inline assembly support for the RP2 target
when it is operating in RISC-V mode.  This brings the feature set when in
RISC-V mode to parity with what's available in ARM mode.

Signed-off-by: Alessandro Gatti <[email protected]>
This commit implements a method to detect at runtime if inline assembler
support is enabled, and if so which platform it targets.

This allows clean test runs even on modified version of ARM-based ports
where inline assembler support is disabled, running inline assembler tests
on ports that have such feature not enabled by default and manually
enabled, and allows to always run the correct inlineasm tests for ports
that support more than one architecture (esp32, qemu, rp2).

Signed-off-by: Alessandro Gatti <[email protected]>
Copy link
Member

@dpgeorge dpgeorge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on RPI_PICO2 in RISC-V mode. Works well.

@dpgeorge dpgeorge merged commit 931a768 into micropython:master Jan 2, 2025
65 checks passed
@dpgeorge
Copy link
Member

dpgeorge commented Jan 2, 2025

@agatti thanks for your efforts on this, it's a very nice feature to have, especially because RISC-V will become more and more prevalent in MCUs in the future.

BTW, if you want to have a go at reducing the code size, you should be able to cut off at least 1k with the following tricks:

  • use qstr_short_t for the type of REGISTERS_QSTR_TABLE
  • for the argumentX_mask variables of opcode_t, make them indirect enums into a table of masks (because there are only a limited set of possible masks), eg enum { MASK_00000000, MASK_00000FFF, MASK_FFFFFFFF, ...} then access via MASK_TABLE[opcode->argument1_mask].
  • combined with the above, use qstr_short_t for the qstring entry in opcode_t and 5-bit fields for argumentX_mask, which should allow fitting the qstr and the 3 argument masks in 32 bits

@agatti agatti deleted the inlineasm-riscv32 branch January 2, 2025 21:05
@agatti
Copy link
Contributor Author

agatti commented Jan 2, 2025

* use `qstr_short_t` for the type of `REGISTERS_QSTR_TABLE`

Oh, I don't think I've ever seen qstr_short_t being used before. So I must assume there's a hard limit of 64k qstrs?

* for the `argumentX_mask` variables of `opcode_t`, make them indirect enums into a table of masks [...].

That's nice - wish I've thought of that :) This will end up as a PR after 1.25.0 is out, so it can be featured in a later version along with eventual fixes for bugs that may be found in the meantime.

All things considered, reducing the footprint by 30% or more may probably make this a candidate for being enabled by default on ESP32Cx targets too. I can probably shorten some error messages' text and coalesce some messages as well to further reduce the space taken up by this feature, I'll see what can be done on that.

Anyway, thanks for the help on this - glad it's been merged!

@dpgeorge
Copy link
Member

dpgeorge commented Jan 2, 2025

I can probably shorten some error messages' text and coalesce some messages as well to further reduce the space taken up by this feature,

Note that there are configurable error message levels:

// Exception messages are removed
#define MICROPY_ERROR_REPORTING_NONE     (0)                                   
// Exception messages are short static strings                                 
#define MICROPY_ERROR_REPORTING_TERSE    (1)                                 
// Exception messages provide basic error details     
#define MICROPY_ERROR_REPORTING_NORMAL   (2)                                 
// Exception messages provide full info, e.g. object names                       
#define MICROPY_ERROR_REPORTING_DETAILED (3)                                                                                

The default for most ports is NORMAL. It's possible to provide different error messages for the different levels.

But actually I think what you have done already is pretty good, it's a much better user experience with more detailed error messages for tricky things like inline assembler. Maybe it could be tweaked to reduce size by a little, eg I just noticed a few messages end in . which should be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants