Thanks to visit codestin.com
Credit goes to github.com

Skip to content

PPC64LE dynarec: infrastructure and scaffolding (no opcodes yet)#3563

Open
runlevel5 wants to merge 2 commits intoptitSeb:mainfrom
runlevel5:ppc64le-phase0a
Open

PPC64LE dynarec: infrastructure and scaffolding (no opcodes yet)#3563
runlevel5 wants to merge 2 commits intoptitSeb:mainfrom
runlevel5:ppc64le-phase0a

Conversation

@runlevel5
Copy link
Contributor

@runlevel5 runlevel5 commented Feb 26, 2026

Hey! Following up on the feedback from #3562 — totally fair point about keeping things smaller and more reviewable. I've broken the PPC64LE dynarec work into smaller incremental PRs. This is the first one: just the infrastructure/scaffolding, with zero opcode implementations.

What's in here

All the plumbing needed to get the PPC64LE dynarec backend compiling, linking, and running — but every single opcode falls through to DEFAULT (interpreter fallback). Nothing is natively recompiled yet.

Infrastructure:

  • PPC64LE instruction emitter (ppc64le_emitter.h) — VSX-based register mapping
  • Assembly routines: prolog, epilog, next-block dispatch, native atomic locks
  • Architecture helpers, constants, printer/disassembler
  • CPU feature detection (POWER9 minimum, ISA 3.0, crypto, DARN)
  • Pass 0-3 headers, private types, function/mapping headers
  • Helper macros and functions for address calculation, FPU/SSE/MMX/AVX register caching, flag management

Shared file changes (minimal):

  • || defined(PPC64LE) additions in dynarec dispatch, signal handling, host detection
  • Build system integration (CMakeLists.txt, CI workflow)
  • Static build fix: -fno-stack-protector for PPC64LE (glibc uses TLS-based stack protector via r13 instead of __stack_chk_guard)

All 29 opcode table files are present as DEFAULT stubs — the switch statements just have default: DEFAULT;. This way the infrastructure is complete and self-contained, ready for opcode implementations to land on top.

Build & test

  • Builds cleanly with -DPPC64LE=1 on Fedora 43 ppc64le (POWER9, 64KB pages)
  • All test01-test10 pass (everything goes through interpreter fallback, as expected)

Stats

  • 79 files changed, ~13,270 insertions
  • ~49 PPC64LE-specific files in src/dynarec/ppc64le/
  • 27 shared file changes (all minimal arch dispatch additions)

What's next

I'll send follow-up PRs that layer on top of this one:

  1. Simple MOV instructions (just a handful of basic MOVs — small PR, easy to review)
  2. Remaining integer ops (ADD/SUB/CMP/etc + the emit helpers)
  3. FPU opcodes (x87 D8-DF)
  4. SSE/SSE2, LOCK prefix, etc.

Each one should be much more digestible to review. Let me know if this breakdown works or if you'd prefer it sliced differently!

@ptitSeb
Copy link
Owner

ptitSeb commented Feb 26, 2026

Why would the --whole-archive flags be needed for the dynarec? What kind of issue did you had without it?

@runlevel5
Copy link
Contributor Author

runlevel5 commented Feb 26, 2026

Why would the --whole-archive flags be needed for the dynarec? What kind of issue did you had without it?

I used to run into the linker issue but for the phase 0a, I could not re-produce the linker errors, I will remove that changes until it emerges in later phases.

EDIT: I've managed to find the errors from my build logs

[ 36%] Built target native_pass0
[ 36%] Linking C static library libdynarec.a
[ 36%] Built target dynarec
[100%] Built target mainobj
[100%] Linking C executable box64
/usr/bin/ld: libdynarec.a: member libdynarec.a(dynarec_ppc64le_660f.c.o) in archive is not an object
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/box64.dir/build.make:970: box64] Error 1
make[1]: *** [CMakeFiles/Makefile2:552: CMakeFiles/box64.dir/all] Error 2
make: *** [Makefile:166: all] Error 2

* which was consuming ~18.5% of CPU due to cache pollution.
*
* This processes 8 bytes per iteration using simple multiply-xorshift
* mixing. For the typical block size (~48 bytes median), this means
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where you get this 48-bytes median, but in real life app / games, I'm pretty sure the typical dynablock size covers much more than 48 x64 bytes

Copy link
Contributor Author

@runlevel5 runlevel5 Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I instrumented db_sizes with a histogram (using DYNAREC_HIST + a separate increment-only rbtree to capture all blocks ever created, not just surviving ones at exit) and tested on PPC64LE (POWER9) with real workloads (I mean I could run more tests on more programs if you prefer):

Workload Blocks Mean Median p75 p90 p99 Max
ioquake3 (3min) 5,542 64.3 23 47 105 604 12,480
UT99 server 6,467 65.9 27 75 147 541 17,498
bash 903 60.8 23 57 158 466 1,243
Unit tests varies 19-74 10-33 20-85 39-165 57-1,046 57-2,583

You can try it for yourself by making following changes

diff --git a/src/box64context.c b/src/box64context.c
index 859245b80..2d225451c 100644
--- a/src/box64context.c
+++ b/src/box64context.c
@@ -248,7 +248,7 @@ void freeALProcWrapper(box64context_t* context);
 void freeCUDAProcWrapper(box64context_t* context);

 #ifdef DYNAREC
-//#define DYNAREC_HIST    // uncomment to print dynablock size histogram at exit
+#define DYNAREC_HIST    // uncomment to print dynablock size histogram at exit
 #ifdef DYNAREC_HIST
 #define HIST_BUCKETS 12
 static const uintptr_t bucket_limits[HIST_BUCKETS] = {
@@ -279,6 +279,27 @@ static void db_size_walk_cb(uintptr_t start, uintptr_t end, uint64_t data, void*
     h->total_bytes += size * n;
     if (size > h->max_size) h->max_size = size;
 }
+// Percentile computation via sorted walk
+#define PERCENTILE_COUNT 5
+typedef struct {
+    uint64_t total_blocks;
+    uint64_t cumulative;
+    int pct_idx;
+    double pct_targets[PERCENTILE_COUNT];
+    uintptr_t pct_values[PERCENTILE_COUNT];
+} percentile_data_t;
+static void db_size_percentile_cb(uintptr_t start, uintptr_t end, uint64_t data, void* userdata) {
+    (void)end;
+    percentile_data_t* p = (percentile_data_t*)userdata;
+    uintptr_t size = start;
+    uint64_t n = data;
+    p->cumulative += n;
+    while (p->pct_idx < PERCENTILE_COUNT
+        && (double)p->cumulative / (double)p->total_blocks >= p->pct_targets[p->pct_idx]) {
+        p->pct_values[p->pct_idx] = size;
+        p->pct_idx++;
+    }
+}
 #endif
 #endif
 EXPORTDYN
@@ -385,11 +406,25 @@ void FreeBox64Context(box64context_t** context)
         if(hist.total_blocks) {
             printf_log(LOG_INFO, "BOX64 Dynarec block size histogram: %" PRIu64 " blocks, %" PRIu64 " total x86 bytes, max=%" PRIuPTR "\n",
                 hist.total_blocks, hist.total_bytes, hist.max_size);
+            printf_log(LOG_INFO, "  mean=%.1f bytes\n",
+                (double)hist.total_bytes / (double)hist.total_blocks);
             for(int i = 0; i < HIST_BUCKETS; i++) {
                 if(hist.count[i])
                     printf_log(LOG_INFO, "  %10s: %" PRIu64 " blocks (%.1f%%)\n",
                         bucket_labels[i], hist.count[i], hist.count[i] * 100.0 / hist.total_blocks);
             }
+            // Compute percentiles (p25, p50/median, p75, p90, p99)
+            percentile_data_t pdata = {0};
+            pdata.total_blocks = hist.total_blocks;
+            pdata.pct_targets[0] = 0.25;
+            pdata.pct_targets[1] = 0.50;
+            pdata.pct_targets[2] = 0.75;
+            pdata.pct_targets[3] = 0.90;
+            pdata.pct_targets[4] = 0.99;
+            rbtree_walk(ctx->db_sizes, db_size_percentile_cb, &pdata);
+            printf_log(LOG_INFO, "  p25=%" PRIuPTR " p50(median)=%" PRIuPTR " p75=%" PRIuPTR " p90=%" PRIuPTR " p99=%" PRIuPTR "\n",
+                pdata.pct_values[0], pdata.pct_values[1], pdata.pct_values[2],
+                pdata.pct_values[3], pdata.pct_values[4]);
         }
     }
 #endif

You can see the actual median is ~23-27 bytes. More than 65% are less than 32 bytes. Having said that long tail of large block goes up to 65 bytes. I am all ears on this topic on how we could find the sweet spot as we might have known the pros and cons of larger block size (for example larger invalidation blast radius or expensive hash check)

Copy link
Collaborator

@ksco ksco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but again: I'd expect the first PR to only implement a few simple MOV instructions, along with only the necessary infrastructure code required for those instructions.

It would be a disaster for a human to review a 13000+ LoC PR, full of unused code.

@ptitSeb
Copy link
Owner

ptitSeb commented Feb 26, 2026

The "small steps" is how we did with all the 3 dynarecs. It ensure proper debugging and the ability to bisect when you find something later in the developements process.

@runlevel5
Copy link
Contributor Author

Sorry, but again: I'd expect the first PR to only implement a few simple MOV instructions, along with only the necessary infrastructure code required for those instructions.

@ksco thanks for the feedback. I humbly think we need this PR to prepare the foundation for those few simple MOV instructions. If you are not happy with the size, I could break this PR into smaller chunks. Most of them are just stubs or fallback to intepreter so I thought it would not cause much inconvenience for reviewers.

@runlevel5 runlevel5 force-pushed the ppc64le-phase0a branch 2 times, most recently from 696a1b9 to 5e2b1aa Compare February 27, 2026 11:08
@runlevel5
Copy link
Contributor Author

@MPC7500 @classilla wondering if you two could help review this PR for me too? Many thanks

@ptitSeb
Copy link
Owner

ptitSeb commented Feb 28, 2026

Thanks for keeping it up-to-date.

The issue for me if that there is a lot of reviewing required to understand all the scaffolding here, and it's especialy harder because a lot of the concept are not used yet. For example, everything about x87/mmx/SSE and more...

Add the complete dynarec infrastructure for PPC64LE (little-endian,
POWER9 minimum). All opcode tables are present as DEFAULT stubs,
falling back to the interpreter. No opcodes are natively implemented
yet -- this PR is purely the scaffolding.

Infrastructure includes:
- PPC64LE instruction emitter (ppc64le_emitter.h) with VSX register mapping
- Assembly routines: prolog, epilog, next-block dispatch, atomic locks
- Architecture helpers, constants, printer/disassembler
- CPU feature detection (POWER9+, ISA 3.0, crypto, DARN)
- Pass 0-3 headers, private types, function/mapping headers
- Helper macros and functions for address calculation, FPU/SSE/MMX/AVX
  register caching, flag management

Shared file changes (minimal):
- Architecture dispatch additions (`|| defined(PPC64LE)`) in dynarec
  infrastructure, signal handling, host detection
- Build system integration (CMakeLists.txt, CI workflow)
- Static build fix: -fno-stack-protector for PPC64LE (glibc uses
  TLS-based stack protector via r13, not __stack_chk_guard)
- --whole-archive linker flag for dynarec static library

Builds cleanly with -DPPC64LE=1. All test01-test10 pass (via
interpreter fallback). Tested on Fedora 43, POWER9, 64KB pages.

Opcode implementations will follow in subsequent PRs.
@runlevel5
Copy link
Contributor Author

The issue for me if that there is a lot of reviewing required to understand all the scaffolding here, and it's especialy harder because a lot of the concept are not used yet. For example, everything about x87/mmx/SSE and more...

I will break it down to even smaller PRs

@ptitSeb
Copy link
Owner

ptitSeb commented Feb 28, 2026

Can you start with the CI build of PPC64LE? That seems usefull as a first step :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants