Thanks to visit codestin.com
Credit goes to github.com

Skip to content

IR Interpreter optimization ideas #19143

@hrydgard

Description

@hrydgard

The IR Interpreter (code link) is more important now that iOS has joined the primary supported platforms.

Mainly, this will be about reducing the amount of instructions interpreted (to reduce interpreter overhead) and to make simpler versions of instructions that are faster to run.

Ideas

We should also not allocate blocks individually, instead use a large bump allocator. This will slightly help cache coherency, and we can use offsets into this buffer in the pseudo-instructions to avoid a lookup per block.

  • Make Downcount metadata on the block, when there's only one Downcount instruction and it's before any branches or branch targets (the latter conditions may be needed for some future optimizations). Or just move them to the top of the block, assuming that they're there.
  • Multi-load/multi-store instructions for consecutive registers (to replace long sequences of say Load32 s0, sp, 0x14, Load32 s1, sp, 0x18, Load32 s0, sp, 0x1C, etc). Additionally, consecutive sequences of storing zero is common. This will need a sorting pass first. For floats, it's common to see Store32 f20, sp, 4; Store32 f22, sp, 8;. That register stride is probably an artifact of a MIPS compiler for a CPU that had double precision support (f20:f21 would be one register then).
  • Vec4Scale + Vec4Add can often be merged into a new Vec4ScaleAdd without adding more operands
  • Super-specialized instructions, for example:
    • AddConst sp, sp, 0x30 is very common, it might be very slightly beneficial to add Increment sp, 0x30
    • sw zero, sp, 0x10 is quite common, can save a register file access
    • Vec4Shuffle: specialize the most common shuffle patterns
    • Vec4Blend: specialize the most common blend patterns
    • Matrix multiplication should be done in one IR instruction (even if broken apart for the other backends to share logic)
    • FCmp + FPCondToReg in-one
  • Write the IRInterpreter in assembler (hopefully not needed, but would love to get rid of the range check that the switch emits, at least on x86) Skipped in favor of IR Interpreter: Improve code gen for the main interpreter loop #20875
  • Avoid breaking apart some instructions that require a lot of tiny instructions as output, like VDot. We do want to break these apart for JitIR compilation, but not for interpretation, so we might need to compile into IR differently depending on context.
  • Syscall could merge with RestoreRoundingMode/ApplyRoundingMode
  • lv.s should not become a complicated Shuffle/Blend thing

There's a bit of tension between keeping the IR easy to run passes on, and making it fast to interpret. Additionally, if we make too many instructions they'll put higher load on the instruction cache...

Maybe we should translate further to an even more specialized interpreter IR, or, only ever use the super-specialized instructions in the very final optimization pass so only the interpreter needs to care about them. In that case they should be marked clearly.

Additionally, things like inlining of small blocks (that would also apply to the JIT) may help.

Pass optimization:

  • Imm tracking for floats could enable replacing long streaks of SetFloatConst f12, 0; StoreFloat f14, a0, offset; StoreFloat f14. a0, offset+4; etc

Missed peephole optimizations:

From GTA, This simply converts fixedpoint 16-bit 4-vectors to float, but generates a lot of bloat:
vs2i.p C100, C200
vi2f.q C100, C100, 23

AddConst a2, sp, 70
AddConst a2, a2, 8
> Should be AddConst a2, sp, 78

FMovFromGPR f12, a1
FCvtSW f12, f12
> should be combined to a move-and-convert instruction. common in syphon filter

sll t0, t0, 0x18
sra t0, t0, 0x18
> this is just a sign extension byte->word

Common when writing floats to display list:

FMovToGPR a2, f12
ShrImm a2, 08
Or a2, a2, t3
Store32 a2, a1, 0000000

This one takes four inputs though.

In the Wipeout games, for some reason (can just omit the load):
sv.q C400, 0x90(sp)
lv.q C400, 0x90(sp)

Test cases

Curiously heavy on the CPU:

  • Crash of the Titans - seems to be both spending a lot of time in an horrendously complex idle loop, and underutilizing the VFPU. Lots of inefficient FPU code, spending lots of time storing and loading regs to/from the stack.

Metadata

Metadata

Assignees

No one assigned

    Labels

    IRInterpreterOccurs with IR Interpreter but not with another CPU backend.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions