PPC64LE dynarec: infrastructure and scaffolding (no opcodes yet)#3563
PPC64LE dynarec: infrastructure and scaffolding (no opcodes yet)#3563runlevel5 wants to merge 2 commits intoptitSeb:mainfrom
Conversation
|
Why would the |
I used to run into the linker issue but for the phase 0a, I could not re-produce the linker errors, I will remove that changes until it emerges in later phases. EDIT: I've managed to find the errors from my build logs |
479f12d to
bccca59
Compare
| * which was consuming ~18.5% of CPU due to cache pollution. | ||
| * | ||
| * This processes 8 bytes per iteration using simple multiply-xorshift | ||
| * mixing. For the typical block size (~48 bytes median), this means |
There was a problem hiding this comment.
Not sure where you get this 48-bytes median, but in real life app / games, I'm pretty sure the typical dynablock size covers much more than 48 x64 bytes
There was a problem hiding this comment.
I instrumented db_sizes with a histogram (using DYNAREC_HIST + a separate increment-only rbtree to capture all blocks ever created, not just surviving ones at exit) and tested on PPC64LE (POWER9) with real workloads (I mean I could run more tests on more programs if you prefer):
| Workload | Blocks | Mean | Median | p75 | p90 | p99 | Max |
|---|---|---|---|---|---|---|---|
| ioquake3 (3min) | 5,542 | 64.3 | 23 | 47 | 105 | 604 | 12,480 |
| UT99 server | 6,467 | 65.9 | 27 | 75 | 147 | 541 | 17,498 |
| bash | 903 | 60.8 | 23 | 57 | 158 | 466 | 1,243 |
| Unit tests | varies | 19-74 | 10-33 | 20-85 | 39-165 | 57-1,046 | 57-2,583 |
You can try it for yourself by making following changes
diff --git a/src/box64context.c b/src/box64context.c
index 859245b80..2d225451c 100644
--- a/src/box64context.c
+++ b/src/box64context.c
@@ -248,7 +248,7 @@ void freeALProcWrapper(box64context_t* context);
void freeCUDAProcWrapper(box64context_t* context);
#ifdef DYNAREC
-//#define DYNAREC_HIST // uncomment to print dynablock size histogram at exit
+#define DYNAREC_HIST // uncomment to print dynablock size histogram at exit
#ifdef DYNAREC_HIST
#define HIST_BUCKETS 12
static const uintptr_t bucket_limits[HIST_BUCKETS] = {
@@ -279,6 +279,27 @@ static void db_size_walk_cb(uintptr_t start, uintptr_t end, uint64_t data, void*
h->total_bytes += size * n;
if (size > h->max_size) h->max_size = size;
}
+// Percentile computation via sorted walk
+#define PERCENTILE_COUNT 5
+typedef struct {
+ uint64_t total_blocks;
+ uint64_t cumulative;
+ int pct_idx;
+ double pct_targets[PERCENTILE_COUNT];
+ uintptr_t pct_values[PERCENTILE_COUNT];
+} percentile_data_t;
+static void db_size_percentile_cb(uintptr_t start, uintptr_t end, uint64_t data, void* userdata) {
+ (void)end;
+ percentile_data_t* p = (percentile_data_t*)userdata;
+ uintptr_t size = start;
+ uint64_t n = data;
+ p->cumulative += n;
+ while (p->pct_idx < PERCENTILE_COUNT
+ && (double)p->cumulative / (double)p->total_blocks >= p->pct_targets[p->pct_idx]) {
+ p->pct_values[p->pct_idx] = size;
+ p->pct_idx++;
+ }
+}
#endif
#endif
EXPORTDYN
@@ -385,11 +406,25 @@ void FreeBox64Context(box64context_t** context)
if(hist.total_blocks) {
printf_log(LOG_INFO, "BOX64 Dynarec block size histogram: %" PRIu64 " blocks, %" PRIu64 " total x86 bytes, max=%" PRIuPTR "\n",
hist.total_blocks, hist.total_bytes, hist.max_size);
+ printf_log(LOG_INFO, " mean=%.1f bytes\n",
+ (double)hist.total_bytes / (double)hist.total_blocks);
for(int i = 0; i < HIST_BUCKETS; i++) {
if(hist.count[i])
printf_log(LOG_INFO, " %10s: %" PRIu64 " blocks (%.1f%%)\n",
bucket_labels[i], hist.count[i], hist.count[i] * 100.0 / hist.total_blocks);
}
+ // Compute percentiles (p25, p50/median, p75, p90, p99)
+ percentile_data_t pdata = {0};
+ pdata.total_blocks = hist.total_blocks;
+ pdata.pct_targets[0] = 0.25;
+ pdata.pct_targets[1] = 0.50;
+ pdata.pct_targets[2] = 0.75;
+ pdata.pct_targets[3] = 0.90;
+ pdata.pct_targets[4] = 0.99;
+ rbtree_walk(ctx->db_sizes, db_size_percentile_cb, &pdata);
+ printf_log(LOG_INFO, " p25=%" PRIuPTR " p50(median)=%" PRIuPTR " p75=%" PRIuPTR " p90=%" PRIuPTR " p99=%" PRIuPTR "\n",
+ pdata.pct_values[0], pdata.pct_values[1], pdata.pct_values[2],
+ pdata.pct_values[3], pdata.pct_values[4]);
}
}
#endifYou can see the actual median is ~23-27 bytes. More than 65% are less than 32 bytes. Having said that long tail of large block goes up to 65 bytes. I am all ears on this topic on how we could find the sweet spot as we might have known the pros and cons of larger block size (for example larger invalidation blast radius or expensive hash check)
ksco
left a comment
There was a problem hiding this comment.
Sorry, but again: I'd expect the first PR to only implement a few simple MOV instructions, along with only the necessary infrastructure code required for those instructions.
It would be a disaster for a human to review a 13000+ LoC PR, full of unused code.
|
The "small steps" is how we did with all the 3 dynarecs. It ensure proper debugging and the ability to bisect when you find something later in the developements process. |
@ksco thanks for the feedback. I humbly think we need this PR to prepare the foundation for those few simple MOV instructions. If you are not happy with the size, I could break this PR into smaller chunks. Most of them are just stubs or fallback to intepreter so I thought it would not cause much inconvenience for reviewers. |
696a1b9 to
5e2b1aa
Compare
|
@MPC7500 @classilla wondering if you two could help review this PR for me too? Many thanks |
5e2b1aa to
edc3cbf
Compare
|
Thanks for keeping it up-to-date. The issue for me if that there is a lot of reviewing required to understand all the scaffolding here, and it's especialy harder because a lot of the concept are not used yet. For example, everything about x87/mmx/SSE and more... |
Add the complete dynarec infrastructure for PPC64LE (little-endian, POWER9 minimum). All opcode tables are present as DEFAULT stubs, falling back to the interpreter. No opcodes are natively implemented yet -- this PR is purely the scaffolding. Infrastructure includes: - PPC64LE instruction emitter (ppc64le_emitter.h) with VSX register mapping - Assembly routines: prolog, epilog, next-block dispatch, atomic locks - Architecture helpers, constants, printer/disassembler - CPU feature detection (POWER9+, ISA 3.0, crypto, DARN) - Pass 0-3 headers, private types, function/mapping headers - Helper macros and functions for address calculation, FPU/SSE/MMX/AVX register caching, flag management Shared file changes (minimal): - Architecture dispatch additions (`|| defined(PPC64LE)`) in dynarec infrastructure, signal handling, host detection - Build system integration (CMakeLists.txt, CI workflow) - Static build fix: -fno-stack-protector for PPC64LE (glibc uses TLS-based stack protector via r13, not __stack_chk_guard) - --whole-archive linker flag for dynarec static library Builds cleanly with -DPPC64LE=1. All test01-test10 pass (via interpreter fallback). Tested on Fedora 43, POWER9, 64KB pages. Opcode implementations will follow in subsequent PRs.
edc3cbf to
929ad2a
Compare
I will break it down to even smaller PRs |
|
Can you start with the CI build of PPC64LE? That seems usefull as a first step :) |
Hey! Following up on the feedback from #3562 — totally fair point about keeping things smaller and more reviewable. I've broken the PPC64LE dynarec work into smaller incremental PRs. This is the first one: just the infrastructure/scaffolding, with zero opcode implementations.
What's in here
All the plumbing needed to get the PPC64LE dynarec backend compiling, linking, and running — but every single opcode falls through to
DEFAULT(interpreter fallback). Nothing is natively recompiled yet.Infrastructure:
ppc64le_emitter.h) — VSX-based register mappingShared file changes (minimal):
|| defined(PPC64LE)additions in dynarec dispatch, signal handling, host detection-fno-stack-protectorfor PPC64LE (glibc uses TLS-based stack protector via r13 instead of__stack_chk_guard)All 29 opcode table files are present as DEFAULT stubs — the switch statements just have
default: DEFAULT;. This way the infrastructure is complete and self-contained, ready for opcode implementations to land on top.Build & test
-DPPC64LE=1on Fedora 43 ppc64le (POWER9, 64KB pages)Stats
src/dynarec/ppc64le/What's next
I'll send follow-up PRs that layer on top of this one:
Each one should be much more digestible to review. Let me know if this breakdown works or if you'd prefer it sliced differently!