Thanks to visit codestin.com
Credit goes to github.com

Skip to content

mpy_ld.py: Support modules larger than 4KiB on armv6m #12241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jonnor
Copy link
Contributor

@jonnor jonnor commented Aug 15, 2023

Hit a problem with modules larger than 4 KiB when working on dynamic native modules for RP2040. This should fix that for ARM Thumb targets, using the similar strategy used for ARM Thumb 2 targets.

Have tested it on RP2040, and it seems to work: My module gets initialized, and the defined functions can be called. I have never written Thumb assembler before, so please do sanity check it!

@jonnor
Copy link
Contributor Author

jonnor commented Aug 15, 2023

Besides testing on device I wrote an automated test. However I am not sure where it should be placed (or if you want such tests), as I could not find any existing unit-tests for functions in mpy_ld.py

Here is the code for the test:

from tools.mpy_ld import asm_jump_thumb
import struct

def play_thumb_bl(instruction, PC=0):
    """Compute the effects of "BL" ARM Thumb instruction on the program counter (PC)

    Based on explanations and pseudocode from
    https://stackoverflow.com/a/70756436/
    """

    bl0, bl1 = struct.unpack('<HH', instruction)

    # check that this is a valid BL instruction
    # Encoding: 1111 HOOO OOOO OOOO
    code0 = (bl0 & 0xF800) >> 11
    assert code0 == 0b11110 # H=0
    code1 = (bl1 & 0xF800) >> 11
    assert code1 == 0b11111 # H=1

    # always forward 1 instruction
    # An offset of 0 means next instruction
    PC += 4

    # effect of first instruction
    offset = bl0 & 0x7FF
    LR = PC + (offset << 12)

    # effect of second instruction
    offset = bl1 & 0x7FF
    PC = LR + (offset << 1)

    return PC


def test_asm_jump_thumb():

    start = 0x1000 # BL only used for more than 11 bit offset
    for offset in range(start, 0xFFFF, 4):

        instruction = asm_jump_thumb(offset)
        pc_change = play_thumb_bl(instruction)

        #print(instruction)
        assert pc_change == offset

test_asm_jump_thumb()

@codecov
Copy link

codecov bot commented Aug 15, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.42%. Comparing base (f61fac0) to head (1962597).
Report is 217 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #12241   +/-   ##
=======================================
  Coverage   98.42%   98.42%           
=======================================
  Files         161      161           
  Lines       21251    21253    +2     
=======================================
+ Hits        20917    20919    +2     
  Misses        334      334           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions
Copy link

github-actions bot commented Aug 15, 2023

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
     mimxrt:    +0 +0.000% TEENSY40
        rp2:    +0 +0.000% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS

@dpgeorge dpgeorge added the tools Relates to tools/ directory in source, or other tooling label Sep 1, 2023
@dpgeorge
Copy link
Member

dpgeorge commented Sep 1, 2023

Thanks for the contribution.

But I don't think this will work because it's a bl instruction which is branch-and-link. That means it stores the return PC in LR. So I don't see how this code can return to the caller, because LR is overwritten.

I'm not sure how this patch worked in your case?

@jonnor
Copy link
Contributor Author

jonnor commented Sep 1, 2023

Hi @dpgeorge and thank you for the review. I have no idea why it would appear to work... This is my first time doing any ARM assembly. Do you have any tips as to what the correct approach would be?

I don't really know where this code is invoked from, which makes it hard a bit for me to reason about. I have also not found any way to get the disassembly for the .mpy files to look at the code produced. I guess I need to bring out a debugger on an ARM Cortex M0 to see what is going on.

@dpgeorge
Copy link
Member

dpgeorge commented Sep 2, 2023

Looking at assembly output of the rp2 firmware (generated by gcc), you can find these patterns used to do a long jump:

200076f8 <__mbedtls_rsa_import_veneer>:
200076f8:       b401            push    {r0}
200076fa:       4802            ldr     r0, [pc, #8]
200076fc:       4684            mov     ip, r0
200076fe:       bc01            pop     {r0}
20007700:       4760            bx      ip
20007702:       bf00            nop
20007704:       100516cd        .word   0x100516cd

That's 16 bytes, but I think it's the proper way to do it. You might be able to get away doing it in less bytes using something like:

push {r0, lr}
bl <dest>
pop {r0, pc}

That's only 8 bytes. The downside is that it increase the stack usage by 8 bytes, and may also make it harder to debug code due to the unusual stack frame. But I don't think that's a concern here. So maybe try using this 8 byte version.

But note that this version should only be used if necessary. That makes it a bit tricky because you need to know the address/offset of mpy_init early on so you know if the offset to this symbol will fit in short branch, or whether the above long branch will be needed.

@jonnor
Copy link
Contributor Author

jonnor commented Sep 4, 2023

Thanks a lot @dpgeorge - I will try something like that, sometime in the next weeks.

Regarding only doing it when necessary, this is already inside a check for how large the offset is. So that should be sufficient?

@dpgeorge
Copy link
Member

dpgeorge commented Sep 4, 2023

Regarding only doing it when necessary, this is already inside a check for how large the offset is. So that should be sufficient?

Yes, but the issue is that the asm_jump function is called twice, the first time with a dummy value of 8. This is used to determine the offsets of everything. And the second time it's called it must return a bytes that's the same size that it returned the first time. There are ways around this but it's a bit tricky.

@jonnor
Copy link
Contributor Author

jonnor commented Jul 6, 2024

Yes, but the issue is that the asm_jump function is called twice, the first time with a dummy value of 8. This is used to determine the offsets of everything. And the second time it's called it must return a bytes that's the same size that it returned the first time. There are ways around this but it's a bit tricky.

I see. In that case, I propose doing the longer jump always. It is only a few bytes more than the short jump. Seeing as this is for code that exists once per module, and is called once (at import time).

@jonnor
Copy link
Contributor Author

jonnor commented Jul 7, 2024

I tried doing the jump with the pushing and popping (8 byte version). But first attempts just ended up with things hanging. Will order a hardware debugger, so I can look at this properly.

Where is this jump performed when actually loading the .mpy file? I need to find a decent starting point for my debugging...
I have found the mp_raw_code_load_file() in py/persistentcode.c, that seems to load the .mpy and store it in (executable) memory. And then in do_execute_proto_fun() in py/builtinimport.c seems to be the place where module is executed?
Where mp_make_function_from_proto_fun() and mp_obj_new_fun_native() is used to make a MicroPython function that wraps the executable code? So to the best of my guesses, fun_data is the actual executable code from the .mpy file (the env.full_text when seen from the perspective of build_mpy() in mpy_ld.py)?

Any guidance here would be much appreciated!

@dpgeorge
Copy link
Member

dpgeorge commented Jul 8, 2024

I tried doing the jump with the pushing and popping (8 byte version). But first attempts just ended up with things hanging.

Maybe try doing the 16-byte version to start with, because we know that works.

Where is this jump performed when actually loading the .mpy file?

When the .mpy is imported (or any .py file is imported) the top-level module code is executed. That's the point where the native code is run, and it just executes the first instruction in the text segment. That first instruction is this jump, which jumps to mpy_init within the native code.

The transfer of control from MicroPython runtime to the native code (the jump) happens via:

do_execute_proto_fun
  mp_call_function_0
    mp_call_function_n_kw
      fun_viper_call
        <jump instruction>
          mpy_init

Using a veneer with the BX instruction
All other architectures seems to support larger modules already

Fixes exception such as:

    File "micropython/tools/mpy_ld.py", line 904, in build_mpy
      jump = env.arch.asm_jump(entry_offset)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "micropython/tools/mpy_ld.py", line 92, in asm_jump_thumb
      assert b_off >> 11 == 0 or b_off >> 11 == -1, b_off
                               ^^^^^^^^^^^^^^^^^
    AssertionError: 6153

Signed-off-by: Jon Nordby <[email protected]>
@jonnor jonnor force-pushed the fix-large-thumb-jump branch from f3191bc to 1962597 Compare July 14, 2024 23:38
@jonnor
Copy link
Contributor Author

jonnor commented Jul 14, 2024

@dpgeorge thanks a lot for the tips. Being able to trace the code in the debugger made it much more doable to figure this out.

After some fiddling, I seem to have gotten the BX/veneer type of jump to work. It is like the example veneer from GCC you showed above - except that we need to add PC to our offset value to achieve a relative jump.
I will do some more testing on various module sizes etc., clean up the code a bit, and probably update the native mod tests for ARMv6m.

Is it acceptable to do this far jump always? Then the change can be localized to asm_jump_thumb(). Otherwise, we will have to complicate the overall logic of mpy_ld to account for the potentially-changing (as you mentioned in a previous comment). That seems a bit of trouble to save (up to) 12 bytes in module size... But it is up to you as maintainers to decide.

@dpgeorge
Copy link
Member

After some fiddling, I seem to have gotten the BX/veneer type of jump to work.

Great! The code you pushed here looks a lot better now.

Is it acceptable to do this far jump always?

Well... if you could get the following jump version working IMO that would be better (it's half the size):

push {r0, lr}
bl <dest>
pop {r0, pc}

If that works I'd be happy to have that be used in all cases.

@vshymanskyy
Copy link
Contributor

Looks like this is required for WebAssembly .wasm to .mpy conversion: https://github.com/vshymanskyy/wasm2mpy

@dpgeorge
Copy link
Member

dpgeorge commented Sep 8, 2024

@jonnor do you want to have a go at reducing the size of the jump, based on my comment #12241 (comment) ?

If not, I can do it (and you could test it).

@jonnor
Copy link
Contributor Author

jonnor commented Sep 8, 2024

@dpgeorge I would love some help with this. It is quite far outside my area of expertise, and longer stretches of focus time to do this kind of stuff is rare in the upcoming weeks/months. But I should be able to test it :)

@dpgeorge
Copy link
Member

dpgeorge commented Sep 9, 2024

See #15812 for an alternative to this PR that emits 8 bytes for the jump.

@jonnor
Copy link
Contributor Author

jonnor commented Sep 14, 2024

Superseeded by the other MR linked by @dpgeorge

@jonnor jonnor closed this Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tools Relates to tools/ directory in source, or other tooling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants