Emulation Outline
Emulation Interpretation
basic, threaded, directed threaded other issues
Binary translation
code discovery, code location other issues
Control Transfer Optimizations
1
Key VM Technologies
Emulation binary in one ISA is executed in processor supporting a different ISA Dynamic Optimization binary is improved for higher performance
may be done as part of emulation may optimize same ISA (no emulation needed)
X86 apps Windows Alpha Emulation
HP UX HP Apps.
HP PA ISA
Optimization
2
Emulation Vs. Simulation
Emulation
method for enabling a (sub)system to present the same interface and characteristics as another
ways of implementing emulation
interpretation: relatively inefficient instruction-at-a-time binary translation: block-at-a-time optimized for repeated e.g., the execution of programs compiled for instruction set A on a machine that executes instruction set B.
Simulation
method for modeling a (sub)systems operation objective is to study the process; not just to imitate the function typically emulation is part of the simulation process
3
Definitions
Guest
environment being supported by underlying platform
Guest
Host
underlying platform that provides guest environment
supported by
Host
Definitions (2)
Source ISA or binary
original instruction set or binary the ISA to be emulated
Source
Target ISA or binary
ISA of the host processor underlying ISA
emulated by
Source/Target refer to ISAs Guest/Host refer to platforms
Target
Emulation
Required for implementing many VMs. Process of implementing the interface and functionality of one (sub)system on a (sub)system having a different interface and functionality
terminal emulators, such as for VT100, xterm, putty
Instruction set emulation
binaries in source instruction set can be executed on machine implementing target instruction set e.g., IA-32 execution layer
6
Interpretation Vs. Translation
Interpretation
simple and easy to implement, portable low performance threaded interpretation
Binary translation
complex implementation high initial translation cost, small execution cost selective compilation
We focus on user-level instruction set emulation of program binaries.
7
Interpreter State
An interpreter needs to maintain the complete architected state of the machine implementing the source ISA registers memory code data stack
Program Counter
Code
Condition Codes Reg 0 Reg 1
. . .
Data
Reg n-1
Stack
Interpreter Code
Decode Dispatch Interpreter
Decode and dispatch interpreter
step through the source program one instruction at a time decode the current instruction dispatch to corresponding interpreter routine very high interpretation cost
while (!halt && !interrupt) { inst = code[PC]; opcode = extract(inst,31,6); switch(opcode) { case LoadWordAndZero: LoadWordAndZero(inst); case ALU: ALU(inst); case Branch: Branch(inst); . . .} } Instruction function list
9
Decode Dispatch Interpreter (2)
Instruction function: Load
LoadWordAndZero(inst){ RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32)>> 32; PC = PC + 4; }
10
Decode Dispatch Interpreter (3)
Instruction function: ALU
ALU(inst){ RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst, 15,5); source1 = regs[RA]; source2 = regs[RB]; extended_opcode = extract(inst,10,10); switch(extended_opcode) { case Add: Add(inst); case AddCarrying: AddCarrying(inst); case AddExtended: AddExtended(inst); . . .} PC = PC + 4; }
11
Decode Dispatch Efficiency
Decode-Dispatch Loop
mostly serial code case statement (hard-to-predict indirect jump) call to function routine return
Executing an add instruction
approximately 20 target instructions several loads/stores and shift/mask steps
Hand-coding can lead to better performance
example: DEC/Compaq FX!32
12
Indirect Threaded Interpretation
High number of branches in decode-dispatch interpretation reduces performance
overhead of 5 branches per instruction
Threaded interpretation improves efficiency by reducing branch overhead
append dispatch code with each interpretation routine removes 3 branches threads together function routines
13
Indirect Threaded Interpretation (2)
LoadWordAndZero: RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs(RA); address = source + displacement; regs(RT) = (data(address)<< 32) >> 32; PC = PC +4; If (halt || interrupt) goto exit; inst = code[PC]; opcode = extract(inst,31,6) extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine;
14
Indirect Threaded Interpretation (3)
Add: RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst,15,5); source1 = regs(RA); source2 = regs[RB]; sum = source1 + source2 ; regs[RT] = sum; PC = PC + 4; If (halt || interrupt) goto exit; inst = code[PC]; opcode = extract(inst,31,6); extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine;
15
Indirect Threaded Interpretation (4)
Dispatch occurs indirectly through a table
interpretation routines can be modified and relocated independently
Advantages
binary intermediate code still portable improves efficiency over basic interpretation
Disadvantages
code replication increases interpreter size
16
Indirect Threaded Interpretation (5)
source code "data" accesses interpreter routines source code interpreter routines
dispatch loop
Decode-dispatch
Threaded
17
Predecoding
Parse each instruction into a pre-defined structure to facilitate interpretation
separate opcode, operands, etc. reduces shifts / masks significantly more useful for CICS ISAs changes to input binary damages portability
07
lwz add stw
r1, 8(r2) r3, r3,r1 r3, 0(r4)
1 3 3
2 08 1 37 4
08 03 00
(load word and zero) (add) (store word)
18
Predecoding (2)
struct instruction { unsigned long op; unsigned char dest, src1, src2; } code [CODE_SIZE]; Load Word and Zero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; If (halt || interrupt) goto exit; opcode = code[TPC].op routine = dispatch[opcode]; goto *routine;
19
Direct Threaded Interpretation
Allow even higher efficiency by
removing the memory access to the centralized table requires predecoding dependent on locations of interpreter routines
loses portability
001048d0 1 2 08 00104800 3 1 03 00104910 3 4 00
(load word and zero) (add) (store word)
20
Direct Threaded Interpretation (2)
Predecode the source binary into an intermediate structure Replace the opcode in the intermediate form with the address of the interpreter routine Remove the memory lookup of the dispatch table Limits portability since exact locations of the interpreter routines are needed
21
Direct Threaded Interpretation (3)
Load Word and Zero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; If (halt || interrupt) goto exit; routine = code[TPC].op; goto *routine;
22
Direct Threaded Interpretation (4)
source code intermediate interpreter code routines
predecoder
23
Interpreter Control Flow
Decode for CISC ISA Individual routines for each instruction
General Decode (fill-in instruction structure)
Dispatch
Inst. 1 specialized routine
Inst. 2 specialized routine
...
Inst. n specialized routine
24
Interpreter Control Flow (2)
For CISC ISAs
multiple byte opcode make common Simple Simple Inst. 1 Inst. m ... specialized specialized cases routine routine fast
Dispatch on first byte
Complex Inst. m+1 specialized routine
...
Complex Inst. n specialized routine
Prefix set flags
Shared Routines
25
Binary Translation
Translate source binary program to target binary before execution is the logical conclusion of predecoding get rid of parsing and jumps altogether allows optimizations on the native code achieves higher performance than interpretation needs mapping of source state onto the host state (state mapping)
26
Binary Translation (2)
x86 Source Binary
addl movl add
%edx,4(%eax) 4(%eax),%edx %eax,4
Translate to PowerPC Target
r1 points to x86 register context block r2 points to x86 memory image r3 contains x86 ISA PC value
27
Binary Translation (3)
lwz addi lwzx lwz add stw addi lwz addi lwz stwx addi lwz addi stw addi r4,0(r1) r5,r4,4 r5,r2,r5 r4,12(r1) r5,r4,r5 r5,12(r1) r3,r3,3 r4,0(r1) r5,r4,4 r4,12(r1) r4,r2,r5 r3,r3,3 r4,0(r1) r4,r4,4 r4,0(r1) r3,r3,3 ;load %eax from register block ;add 4 to %eax ;load operand from memory ;load %edx from register block ;perform add ;put result into %edx ;update PC (3 bytes) ;load %eax from register block ;add 4 to %eax ;load %edx from register block ;store %edx value into memory ;update PC (3 bytes) ;load %eax from register block ;add immediate ;place result back into %eax ;update PC (3 bytes)
28
Binary Translation (4)
binary translated target code source code
binary translator
29
State Mapping
Maintaining the state of the source machine on the host (target) machine. state includes source registers and memory contents source registers can be held in host registers or in host memory reduces loads/stores significantly easier if target registers > source registers
30
Register Mapping
Map source registers to target registers
spill registers if needed
source ISA
Source Register Block
target ISA
R1
R2
if target registers < source registers
map some to memory map on per-block basis
Source Memory Image
program counter
R3
stack pointer reg 1 reg 2
R2 R5 R6
Reduces load/store significantly
improves performance
reg n
RN+4
31
Register Mapping (2)
r1 r2 r3 r4 r7 points to x86 register context block points to x86 memory image contains x86 ISA PC value holds x86 register %eax holds x86 register %edx etc. r16,r4,4 r17,r2,r16 r7,r17,r7 r16,r4,4 r7,r2,r16 r4,r4,4 r3,r3,9 ;add 4 to %eax ;load operand from memory ;perform add of %edx ;add 4 to %eax ;store %edx value into memory ;increment %eax ;update PC (9 bytes)
addi lwzx add addi stwx addi addi
32
Predecoding Vs. Binary Translation
Requirement of interpretation routines during predecoding. After binary translation, code can be directly executed.
33
Code Discovery Problem
May be difficult to statically translate or predecode the entire source program Consider x86 code
mov movl %ch,0 ??
31 c0 8b b5 00 00 03 08 8b bd 00 00 03 00 %esi, 0x08030000(%ebp) ??
34
Code Discovery Problem (2)
Contributors to code discovery problem
variable-length (CISC) instructions indirect jumps data interspersed with code padding instructions to align branch targets
source ISA instructions inst. 1 inst. 3 reg. data inst. 2 jump data in instruction stream
inst. 5 inst. 6 uncond. brnch pad jump indirect to??? inst. 8
pad for instruction alignment
35
Code Location Problem
Mapping of the source program counter to the destination PC for indirect jumps
indirect jump addresses in the translated code still refer to source addresses for indirect jumps
x86 source code
movl jmp addi lwzx
%eax, 4(%esp) %eax r16,r11,4 r4,r2,r16
;load jump address from memory ;jump indirect through %eax ;compute x86 address ;get x86 jump address ; from x86 memory image ;move to count register ;jump indirect through ctr
36
PowerPC target code
mtctr r4 bctr
Simplified Solutions
Fixed-width RISC ISA are always aligned on fixed boundaries Use special instruction sets (Java)
no jumps/branches to arbitrary locations no data or pads mixed with instructions all code can then be discovered
Use incremental dynamic translation
37
Incremental Code Translation
First interpret
perform code discovery as a by-product
Translate code
incrementally, as it is discovered place translated code in code cache use lookup table to save source to target PC mappings
Emulation process
execute translated block lookup next source PC in lookup table
if translated, jump to target PC else, interpret and translate
38
Incremental Code Translation (2)
source binary
SPC to TPC Lookup Table
miss
Interpreter Emulation Manager
translator
hit
Translation Memory
39
Dynamic Basic Block
Unit of translation during dynamic translation. Leaders identify starts of static basic blocks
first program instruction instruction following a branch or jump target of a branch or jump
Runtime control flow identify dynamic blocks
instruction following a taken branch or jump at runtime
40
Dynamic Basic Block (2)
Static Basic Blocks add... load... store ... loop: load ... add ..... store brcond skip load... sub... skip: add... store brcond loop add... load... store... jump indirect ... ... block 1 add... load... store ... loop: load ... add ..... store brcond load... sub... skip: add... store brcond loop: load ... add ..... store brcond skip: add... store brcond ...
41
Dynamic Basic Blocks
block 2
block 1 skip block 2 loop block 3 skip block 4 loop
block 3 block 4
block 5
Flow of Control
Even after all blocks are translated, control flows between translated blocks and emulation manager. EM connects the translated blocks during execution. Optimizations can reduce the overhead of going through the EM between every pair of translation blocks.
42
Flow of Control (2)
translation block
Emulation Manager
translation block
translation block
43
Tracking the Source PC
Update SPC as part of translated code
place SPC in stub
Emulation Manager
General approach
translator returns to EM via branch-and-link (BL) SPC placed in stub immediately after BL EM uses link register to find SPC and hash to next target code block
Hash Table
Code Block
Branch and Link to EM Next Source PC
Code Block
44
Emulation Manager Flowchart
Start with SPC Lookup SPC -> TPC in Map Table
Hit in Table?
No
Yes Branch to TPC and Execute Translated Block
Use SPC to Read Insts. from Source Memory Image -------------------Interpret, Translate and Place into Tranlsation Memory
Write new SPC -> TPC mapping into Table Get SPC for next Block
45
Translation Chaining
Translation blocks are linked into chains If the successor block has not yet being translated
code is inserted to jump to the EM later, after jumping to the EM, if the EM finds that the successor block has being translated, then the jump is modified to instead point directly to the successor
46
Translation Chaining (2)
Without Chaining
translation block
With Chaining
translation block
VMM
translation block
VMM
translation block
translation block
translation block
translation block
47
Translation Chaining (3)
Creating a link:
get next SPC
2 1 Predecessor 3 Jump JAL TPC TM next SPC 5
Lookup Successor
4
Set up chain
Successor
48
Translation Chaining (4)
PowerPC Translation
9AC0: lwz add stw addic. beq bl 4FDC b 51C8 stw xor bl 6200 r16,0(r4) r7,r7,r16 r7,0(r5) r5,r5,-1 cr0,pc+12 F000 9c08 r7,0(r6) r7,r7,r7 F000 ;load value from memory ;accumulate sum ;store to memory ;decrement loop count, set cr0 ;branch if loop exit ;branch & link to EM ;save source PC in link register ;branch along chain ;save source PC in link register ;store last value of %edx ;clear %edx ;branch & link to EM ;save source PC in link register
9AE4: 9C08:
49
Software Indirect Jump Prediction
For blocks ending with an indirect jump
chaining cannot be used as destination can change SPCTPC map table lookup is expensive
indirect jump locations seldom change
use profiling to find the common jump addresses inline frequently used SPC addresses; most frequent SPC destination addresses given first
If Rx == addr_1 goto Else if Rx == addr_2 Else if Rx == addr_3 Else hash_lookup(Rx) target_1 goto target_2 goto target_3 ; do it the slow way
50
Dynamic Translation Issues
Tracking the source PC
SPC used by the emulation manager and interpreter
Handle self-modifying code
programs modifying (perform stores) code at runtime
Handle self-referencing code
programs perform loads from the source code
Provide precise traps
provide precise source state at traps and exceptions
51
Same ISA Emulation
Same source and target ISAs Applications
simulation OS call emulation program shepherding performance optimization
52
Instruction Set Issues
Register architectures Condition codes
floating point decimal MMX register mappings, reservation of special registers lazy evaluation as needed
Data formats and arithmetic
Address resolution Data Alignment Byte order
byte vs word addressing natural vs arbitrary big/little endian
53
Register Architectures
GPRs of the target ISA are used for
holding source ISA GPR holding source ISA special-purpose registers point to register context block and memory image holding intermediate emulator values
Issues
target ISA registers < source ISA registers prioritizing the use of target ISA registers
54
Condition Codes
Condition codes are not used uniformly
IA-32 ISA sets CC implicitly SPARC and PowerPC set CC explicitly MIPS ISA does not use CC
Neither ISA uses CC
nothing to do
Source ISA does not use CC, target ISA does
easy; additional ins. to generate CC values
55
Condition Codes (cont)
Source ISA has explicit CC, target ISA no CC
trivial emulation of CC required
Source ISA has implicit CC, target ISA no CC
very difficult and time consuming to emulate CC emulation may be more expensive than instruction emulation
56
Condition Codes (cont)
Lazy evaluation
CC are seldom used only generate CC if required store the operands and the operation that set each condition code
Optimizations can also be performed to analyze code to detect cases where CC generated will never be used
57
Lazy Condition Code Evaluation
%ecx,%ebx label1 . . . label1: jz target R4 eax PPC to R5 ebx x86 register R6 ecx mappings . . R24 scratch register used by emulation code R25 condition code operand 1 ;registers R26 condition code operand 2 ;used for R27 condition code operation ;lazy condition ;emulation code R28 jump table base address
58
add jmp
Lazy Condition Code Evaluation (2)
mr mr li add b label1: bl beq r25,r6 r26,r5 r27,add r6,r6,r5 label1 ... genZF cr0,target ... ;save operands ;and opcode for ;lazy condition code emulation ;translation of add
;branch and link genZF code ;branch on condition flag
genZF: add r29,r28,r27 ;add opcode to jump table base mtctr r29 ;copy to counter register bctr ;branch via jump table ... ... add: add. r24,r25,r26 ;perform PowerPC add, set cr0 blr ;return
59
Data Formats and Arithmetic
Maintain compatibility of data transformations. Data formats are arithmetic operations are standardized
twos complement representation IEEE floating point standard basic logical/arithmetic operations are mostly present Exceptions: IA32 FP uses 80-bit intermediate results PowerPC and HP PA have multiply-and-add (FMAC) which has a higher precision on intermediate values integer divide vs. using FP divide to approximate
ISAs may have different immediate lengths
60
Memory Address Resolution
ISAs can access data items of different sizes
load / stores of bytes, halfwords, full words, as opposes to only bytes and words
Emulating a less powerful ISA
no issue
Emulating a more powerful ISA
loads: load entire word, mask un-needed bits stores: load entire word, insert data, store word
61
Memory Data Alignment
Aligned memory access
word accesses performed with two low order bits 00, halfword access must have lowest bit 0, etc.
Target ISA does not allow unaligned access
break up all accesses into byte accesses ISAs provide supplementary instructions to simplify unaligned accesses unaligned access traps, and then can be handled
62
Byte Order
Ordering of bytes within a word may differ
little endian and big endian
Target code must perform byte ordering Guest data image is generally maintained in the same byte order as assumed by the source ISA Emulation software modifies addresses when bytes within words are addressed
can be very inefficient
Some target ISAs may support both byte orders
e.g., MIPS, IA-64
63