retirado de: https://dmitry.gr/index.php?r=06.%20Thoughts&proj=10.
%20RomRam
Using QSPI RAM with RP2040's SSI in read-write mode
Table of Contents
The problem
Towards a solution
Read-only RAM is not much use
Let the nasty hacks begin
The horror
Emulators all the way down
Polishing it to perfection
There are always hardware bugs
Memory protection
Multi-CPU
Performance
Download
Comments...
The problem
Can you use 8MB of external RAM with RP2040, memory mapped, like real memory? I call this ROMRAM
RP2040 is a rather versatile chip. One of its most convenient features is support for flash XIP via SSI. SSI is
quite configurable and can support all sorts of flash chips. It is, of course, not entirely bug free (try to
configure it for SPI commands and QPI addresses, for example, see how that goes), but a large memory
with a fast cache is super nice. There is only one issue: RP2040's XIP mode only supports read and
execute accesses not writes. This makes sense given its purpose and what it was designed for, but who
cares about that? COULD we attach a RAM to it? Well, actually this is not too hard. QSPI SRAM chips
exist, made by ISSI, APMEMORY, and (my favourite) VilsionTech. They talk more or less the same
protocol, and getting SSI to talk to them is trivial. This is useless... You can indeed manually issue read
and write accesses to it, but it is not memory-mapped and thus useless. Could it be? Sure? Enabling XIP
and configuring it properly will work - the RAM will support read and execute, but not write. This is still
not all that useful either.
Towards a solution
First of all, how would you boot without persistent memory? I solved this by having both a flash and a
RAM onboard. RP2040 only has a single nCS pin for SSI and only a single memory mapped address range,
so we'll not be able to use them both. The idea is to boot from flash, copy flash to the start of RAM, and
continue running from RAM. How do we make all of this work? It does not take much: two OR gates and
a NOT gate will do. In my design I used a tiny SMD dual-OR gate IC and a tiny SMD NAND gate as an
inverter. We'll also need two resistors. The output of RP2040's SSI's nCS is pulled up, and is an input to
one of the inputs of each OR gate. A GPIO pin called RAM/nROM and pulled down by default is the other
part of the equation. It goes to input of (gate A) and to the inverter. The output of the inverter is an input
to the other OR gate (gate B). Gate A's output wll go to the flash chip's nCS input, gate B's output goes to
RAM's nCS.
What does this accompish? When we boot, the GPIO is floating, the pulldown will provide a logic low,
this means that RP2040's SSI accesses flash (via gate A), and we can boot. The first stage loader can load
a larger second stage loader to internal RAM. That loader can copy the entire appliction (in my case
2MB) from flash to RAM using almost all the internal memory as temporary space (in my case 256KB). It
can toggle the RAM/nROM pin and reconfigure SSI as needed to access flash and RAM. Then, XIP can be
enabled, and with proper SSI config, the RAM/nROM can be left in the high state, causing all accesses to
go to RAM now.
This will almost work. If you actually try this, you'll find a fun bug. If you attempt to reset the RP2040
using its RUN pin, you'll note that the manual is wrong, and the GPIO module does NOT get reset, the pin
does NOT go back to floating, and you are still accessing RAM and not flash. Oopsie... Not sure how this
was not noticed. In my case this was not a problem since when I ran out of pins, I moved RAM/nROM to
an i2c io expander, and its nRST input does work. If you plan to use this without an io expander, keep this
annoyance in mind.
Read-only RAM is not much use
OK, so our RP2040 now has a memory-mapped RAM. This is quite useless since we cannot write to it
directly. Oh, sure, we can issue SSI commands, but this is (1) annoying, (2) boring, and (3) will not allow
unmodified software that needs a few megabytes of RAM to run. How do we make this better? With
nasty hacks, of course! The RP2040 has a few features (and misfeatures) that we can glue together to
improve the situation. The XIP cache allows us to flush lines in it, which will be important since the cache
has no idea that the backing store is writeable and can change. There is also an MPU which we can
[ab]use.
Let the nasty hacks begin
By default, a write anywhere to 0x10xxxxxx (normal cached access to XIP) will be treated as a command
to flush a cache line. That means that any write attempt in normal code will be silently ignored. No fun!
Let's use the MPU to write-protect the region. Now a write attempt will trigger a HardFault. Ok, that's
better! Our HardFault handler can now ... quickly interpret the faulting instruction, emulate the write,
flush the cache line, and resume. This sounds easy ... NOT.
The horror
Let's consider the concept. Clearly, this HardFault handler cannot itself live in XIP memory, since we do
not want the XIP cache attempting a read while we're trying to issue a write. There will also be some
other limits. We can only emulate accesses we can understand. What other kinds are there? There are
two more sources of writes in the system besides code. One is DMA. The answer here is simple: we're
targeting running unmodified code from elsewhere. Such code would not be relying on RP2040's DMA,
so no issue here. And if you use DMA, be careful to not attempt to DMA to our ROMRAM (reads are OK).
The second source of writes we cannot understand and emulate is the Cortex-M0 CPU itself. The CPU will
push 8 words to the current stack on any interrupt or fault. If the current stack lives in our ROMRAM,
these writes will fail (caught by the MPU) and we'll have lost the info we need to resume the current
code. The answer is, more or less, the same as before. Most likely "existing code" does not directly
manipulate the stack pointer, so this should be avoidable. If you are writing new code and relying on
ROMRAM, keep your stack in an internal memory of some sort. Easy.
Emulators all the way down
How easy is it to write a super fast partial ARMv6M emulator that can properly emulate any write
instruction, including complex ones like STMIA? It is actually not too hard, especially if you throw some
RAM at the problem. The simplest way to dispatch on the instruction type is to use the top 7 bits of it.
That implies a table of 127 entries. That is 256 bytes of just jump instructions. This is not too hard to
justify, really. So, as we take the HardFault, and after we assume that PC and CPSR.T are set right
(checking for this would take more cycles), we can read the faulting instruction. Shift it right, add this to
PC, and then come the 128 jump instructions to dispatch based on all the possible 128 cases. Most of
them will go to a "some other fault happened" label since they do not decode to a valid instruction that
could have caused a write. There are 2 variants of each: STRB, STRH, and STR that we need to handle.
Since ARMv6M requires all writes to be properly aligned, we need not worry about any QSPI RAM page-
crossing limitations here. We get the value from the proper register (decoding this using a few more
jumptables), byteswap it (SPI is BE, CPU is LE), and issue the write directly to the SSI hardware.
And then there is STMIA... This is a complex beast that can write up to 8 words to RAM at any word-
aligned address. There are three ways I could have handled this. The first is to issue each write as a word
write to QSPI. This will work for all QSPI RAMs. The second is to issue it as one long write. This is the
fastest option, but it will only work on Vilsion RAMs since both ISSI and APMEMORY chips wrap all
accesses to a 1KB-address-window. The third option is to detect crossing a 1KB boundary, and switch
between the above options. This is the most complex option, and the checking itself may be more cost
than it is worth. My code uses option two, since I use chips from VilsionTech. With that, emulating STMIA
is just a matter of sending the proper words to write in a row fast enough.
Polishing it to perfection
There are always hardware bugs
Fast enough? What!? Yes... RP2040's SSI seems to ignore the programmed "NDF" value for write-only
transactions. Once it has started a write, it will raise nCS anytime the TX FIFO is empty. This means that
you need to fill it just fast enough to keep it busy. This, in turn, means that you should carefully watch
your SSI clock divisor... There was also an issue I found with writing to the SSI FIFO too fast (even when it
is empty) and a NOP was needed. Do not ask... There were more bugs in the SSI. For example,
sometimes, requesting a cache flush would trigger a XIP read. As you can imagine, this completely breaks
things if we're in the middle of issuing a write command. The solution there was to delay all cache
flushing till after the writing is done. This was only an issue for STMIA, of course, since all other writes
are simple already. It might be reasonable to ask whether interrupts could cause any issues to this
requirement of precise timing. The answer is no, since this code runs in HardFault context - interrupts
will wait for it to be finished. This prioritization is important, since it also allows the interrupt handlers to
easily write to ROMRAM.
Memory protection
This brings us to another interesting topic. I mentioned that the MPU is used to catch the writes. But I
did not mention disabling the MPU. One might ask how it is that I flush the proper cache lines without
re-triggering it (since cache flushes are done via writes). The answer is HFNMIENA. This bit in the MPU
config needs to be set. It tells the CPU core to ignore the MPU while running in HardFault and NMI
contexts. Not having to wrangle the MPU for each write saves valuable cycles in the handler, allowing it
to be faster. But what if you do not want the entire ROMRAM region to be writeable? This is supported.
Two global variables exist. One (mRomRamStart) records the address of the first writeable ROMRAM
address, the other (mRomRamLen) records the writeable area length. They may be modified anytime to
adjust the writeable region. In rePalm project, I use them to split ROMRAM into three regions, for
example. Region A is always below mRomRamStart and is always read-only (the copy of the code we're
running that the second stage loader copied to RAM). Region B is next and is writeable or not based on
an API call to protect it or not (PalmOS is weird). Region C is always writeable. This is pretty easy to do
with the provided knobs.
Multi-CPU
What would multi-CPU support look like for ROMRAM? You'd need to simply add use of one of the
hardware mutexes to make sure two cores do not try to write at the same time. I leave this as an
exercise to the reader. The rest of it will work. Just point the HardFault vectors from both CPUs to the
same ROMRAM HardFault handler and you're done. Cool, right?
Performance
Ok, the million dollar question: how fast is it? Well, reads and execute are native speed, since they work
via the usual pathways and cache. Write speeds depend on how writes are done. Each write instruction
is emulated, and thus write instructions that write more produce hugher throughput. This is good news
for things like memcpy, since that kind of code usually uses STMIA. To put some hard numbers on it, I see
memcpy to ROMRAM hitting 36Mbit/s at stock clock rates, which is not too terrible. This works well for
cases when memory is mor read thatn written (which is common). We can approximate the actual cost
of a write by looking at the instructions of the handler. The actual math differs based on the registers
used. Let's check on a simple STR(immediate). The exception entry and exit take 12 and 10 cycles
respectively. Exception entry code to handle various entry modes and getting the proper exception frame
pointer takes 6 or 7 cycles. Dispatching based on instruction type takes 9 cycles. Getting the address
calculated takes around 17 cycles. The math to verify that we're within writeable bounds takes 11 cycles
or so. Getting the value to write takes around 10 cycles. Issuing the write command takes 28 cycles. Then
we wait for the SSI to finish. At DIV of 4, it will need 256 cycles to finish issuing the write command. But
we overlap with he first 6 of those in code, so effectively it takes 250 cycles of waiting for us to continue.
Cleanup takes 10 more cycles. So all-in-all, a single word write took us
12+10+6+9+17+11+10+28+250+10 = 363 cycles. Some of this could be cut a little with some creative
work (eg: by overlapping more of the SSI write and the data-getting. This optimization is also left as a
exercise to the reader).
Download
The code download for the second stage loader and the HardFault handler is [HERE]. License is BSD 2-
clause. I am too lazy (and disgusted) to turn this into some sort of an arduino or a micropython plugin,
but I am sure someone else will. My provided code will build standalone with no dependency on
anything. License is BSD-2 clause. Enjoy
© 2012-2025