Implement new snapshotting system with support for elastic memory growth via `mmap()` #277

mcevoypeter · 2023-03-03T22:59:18Z

This PR replaces the existing snapshot system in pkg/noun/events.c with a conceptually similar but new implementation. This new implementation removes the need to mmap() a large chunk (i.e. 2GB) of memory when a ship launches. Instead, it creates a file-backed memory mapping for the snapshot and then lazily maps new pages that lie outside of the snapshot when necessary. When a new snapshot is captured, all anonymous mappings are removed and the snapshot files are remapped, leading to a minimization of the Urbit runtime's memory footprint. This works draws inspiration from @joemfb's work in urbit/urbit#6063 and urbit/urbit#6152.

In addition to the functionality described above, this PR also makes the snapshotting system a largely orthogonal component relative to the rest of the runtime. As much as possible, the implementation tries to be simple to read and understand. It's also unit tested. Finally, as an added benefit, SIGSEGV raised as a result of non-loom address accesses now generate a segfault as expected rather than complaining of "address out of loom" (thanks to sigsegv_init() and sigsegv_dispatch()).

Remaining tasks:

Benchmark to determine performance differences compared to old snapshot system.
Document the interfaces introduced in pkg/pma.
Remove pkg/noun/events.c and pkg/noun/events.h.

Resolves #188.

Testing

Functionality

All unit tests pass, including the new [pkg/pma/pma_tests]:

$ bazel test --config=test --build_tests_only ...

Also booted comets on macos-aarch64 and linux-x86_64 using the 1.20, 1.21, and 1.22 pills and successfully sent a DM to my planet. For example, testing the 1.21 pill:

$ bazel build :urbit
$ ./urbit -c some-comet -u https://bootstrap.urbit.org/urbit-v1.21.pill

Also booted a comet using vere-v1.18, exited, and then successfully ran that comet and sent a DM using the binary built from the tip of this branch:

$ git checkout vere-v1.18
$ bazel build :urbit
$ ./urbit -c some-comet -u https://bootstrap.urbit.org/urbit-v1.18.pill
$ git checkout i/188
$ bazel build :urbit
$ ./urbit some-comet

Performance

Ran pkg/vere/benchmarks.c 20 times on both i/188 (commit SHA 67d596c) and develop (commit SHA dee0cef) and computed the following averages:

Benchmark	`i/188`	`develop`
jam og	24.9ms	24.55ms
jam xeno	15.55ms	14.4ms
jam cons	188.1 ms	183.6
jam cons with	5.15ms	7.35ms
cue og	145.4ms	149.35ms
cue atom	30.4ms	31.15ms
cue xeno	31.1ms	30.35ms
cue xeno with	14.8ms	15.0 ms
cue test	21.5ms	21.4ms
cue test with	6.6ms	6.85ms
cue cons	352.5ms	345.6ms
cue re-cons	542.45ms	557.25ms
cue virtual og	130.4ms	129.75ms
cue virtual atom	20.15ms	19.5ms

mcevoypeter · 2023-03-04T00:40:02Z

The linux-aarch64 is failing because pma_tests are exhausting the disk space on the (small) linux-aarch64 VM. I'll debug further first thing Monday.

ashelkovnykov · 2023-03-04T12:48:05Z

@mcevoypeter I don't think this is related to your changes. #255 is now hitting this too, even though it hasn't touched anything that ought to cause such errors. I tired adding a commit that removes pre-installed libs from the GitHub runner image, but it didn't help. I also tried using this GitHub workflow action, but it's not on the approved list and so caused the workflow to fail.

The failures might have something to do with the outage GitHub experienced yesterday?

mcevoypeter · 2023-03-06T15:54:01Z

@ashelkovnykov The linux-aarch64 runner is self-hosted, and it turns out the VM it's running on has a default tmpfs size of 1GB. The PMA tests create a 1GB backing file in /tmp, which exhausts tmpfs, causing the test to fail without cleaning up the file, leaving tmpfs completely full. This then causes other jobs run on the runner to fail because parts of the build process write to tmpfs as well. The solution was to increase the size of tmpfs to 2GB (see https://unix.stackexchange.com/questions/442944/tmpfs-usage-and-resizing).

matthew-levan · 2023-03-07T22:44:17Z

Hi Peter, this is neat! I've been working on fixing the bugs we have in our current snapshot system (namely the strange page issue [we merged a PR for this] and when we crash after a guard page re-centering error [still WIP]).

If you see #234 you can see that a reliable reproduction for the guard page crash is to boot a ship with 1.10 (%zuse 418) and then try and upgrade it to %zuse 415 with the latest binary.

I went ahead and ran the same process with a fakezod using a binary built with this branch i/188, and it crashed unfortunately. I got the following trace:

...
>   arvo: +load next
lull: ~lasbex-sonseg
zuse: ~ligfep-radfyn
vane: %ames: ~hatmeg-lavpun
ames: larva reload
vane: %behn: ~sogmut-tonmus
vane: %clay: ~fildux-polhes
           allocate: reclaim: half of 2139 entries
allocate: reclaim: half of 1069 entries
allocate: reclaim: half of 534 entries
allocate: reclaim: half of 267 entries
allocate: reclaim: half of 133 entries
allocate: reclaim: half of 66 entries
allocate: reclaim: half of 33 entries
allocate: reclaim: half of 16 entries
allocate: reclaim: half of 8 entries
allocate: reclaim: half of 4 entries
allocate: reclaim: half of 2 entries
allocate: reclaim: half of 1 entries
allocate: reclaim: memo cache: empty

bail: meme
allocate: reclaim: half of 2064 entries
pier: serf unexpectedly shut down

matthew@domus:~/ships/dev$

Is this PR meant to fix this specific issue? Perhaps the issue isn't related to the snapshot system at all? I haven't yet determined the root cause of this crash in particular, but I'm also still deep in my inspection of backtraces. What do you think?

mcevoypeter · 2023-03-08T17:01:11Z

@matthew-levan this PR does not address #234, but instead aims to reduce memory pressure on the host system by the runtime by mmap()ing a page at a time on demand rather than mmap()ing the entire loom up front. I'm not surprised to see it fails on memory exhaustion.

mcevoypeter · 2023-03-10T21:34:10Z

The PR description previously mentioned that the guard page had been removed temporarily. That's no longer true; I reimplemented the guard page today.

matthew-levan · 2023-03-13T19:24:48Z

@mcevoypeter, I just tried running the tests on my M1 and got this:

//pkg/pma:pma_tests                                                      FAILED in 1.2s
  /private/var/tmp/_bazel_matt/657ca343b546ec28817b7a65bf3b8da7/execroot/__main__/bazel-out/darwin_arm64-fastbuild/testlogs/pkg/pma/pma_tests/test.log

test.log

exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //pkg/pma:pma_tests
-----------------------------------------------------------------------------
pma: failed to create 1073741824-byte mapping for /tmp/definitely-exists-heap.bin at 0x200000000: Permission denied
Assertion failed: (pma_), function test_pma_, file pma_tests.c, line 204.

mcevoypeter · 2023-03-14T19:01:19Z

I've made the following changes after a live code review with @ashelkovnykov, @joemfb, and @philipcmonk:

Just-mapped pages are marked as dirty, not clean.
A redundant wal_apply() call in map_file_() has been removed.
The unmapping logic at the end of pma_sync() has been removed.
The write-ahead log is validated by recording the checksum of each page in the write-ahead log.

mcevoypeter · 2023-03-14T19:24:52Z

@matthew-levan I suspect the issue you're seeing is related to the fixed address pma_tests is using as the base address for the arena. I'll take a closer look.

ashelkovnykov

Not being snarky - anywhere where the comment is phrased as a question I'm genuinely unsure and would like your opinion, since you are more familiar with Vere and C than I.

pkg/pma/util.c

pkg/pma/pma.h

pkg/pma/pma.c

matthew-levan · 2023-03-17T13:35:59Z

pkg/pma/pma.c

+// CONSTANTS
+
+/// Number of bits in a byte.
+static_ const size_t kBitsPerByte = 8;


Somewhat unrelated, but where does this style of "lower camel case" for some constants come from?

All constants in pkg/pma are prefixed with a k and written in upper camel case to distinguish them from other identifiers; if you see an identifier starting with k in upper camel case, you know it's a constant and therefore will never change, reducing the amount of information you need to keep in your head (such as "where might this value get changed?").

Why not K_BITS_PER_BYTE, in that case? Upper snake case is a widely used for constants and the K_ can still represent that it's defined in pkg/pma .

It's purely stylistic. We can bikeshed endlessly here. The point is to clearly communicate that a given symbol represents a constant value.

I am strongly in favor, for what it's worth, of deciding on a single style and configuring clang-format and clang-tidy (or similar) to programmatically enforce that style. In lieu of that, which has met with resistance in the past, I'm simply using a style that I've found in my experience is most clear and straightforward in communicating the intentions of C code.

I'm also curious why you're opposed to linters.

@barter-simsum Also curious what's wrong with linters / auto-formatting.

@mcevoypeter re: bike-shedding - Sure, you should consider this conversation non-blocking for the purposes of merging. Just seems odd that given a professed preference for idiomatic C to not declare system-level constants as macros, and use standard macro naming conventions for them.

If the linter was as unobtrusive as like the linux kernel's checkpatch I wouldn't mind it much, but most are not. Note the disclaimer:

"Checkpatch is not always right. Your judgement takes precedence over checkpatch messages. If your code looks better with the violations, then its probably best left alone."

Highly opinionated linters really rose in popularity with javascript (sacred airbnb style etc). Prior to that, style guidelines were often just that, guidelines, not law. My experience with these types has been overwhelmingly negative. Code run through these things is often not more readable, just mechanistically more consistent. I also dislike the comment litter they promote if you want to disable certain rules within a region of a file.

strict characters per line, where I should and shouldn't put a comma, opinions on variable names, and var alignment, braces always or braces never, can't use 0xcafebabe (lol) literals and similar, etc etc ad nauseam -- all of this is just hell and then we hook into into the build system.

I don't care if you use a linter to review your own code. They might help find conditions to collapse, unclear conditional expressions, etc, but I definitely don't want it enshrined in the build.

And no, I don't think they reduce "bikeshedding." We just bikeshed about the linter rules instead of the code change itself.

No codebase can meet the style preferences of everyone working on it if there are even two developers. Even if there's only one developer, it still won't meet the style preferences of everyone reading it. The only thing that you can hope for is consistency.

Having some auto-formatter or linter removes this factor from consideration; it removes an axis on which to bikeshed. You're right that there's still one ultimate bikeshed which is "what do we establish the rules to be?". However, it's a much easier problem with which to grapple once the system is in place: just make the top-level code owner for each lib supreme styling dictator.

I don't care if you use a linter to review your own code... but I definitely don't want it enshrined in the build.

I think that we might be talking past each other here somewhat. I don't think anyone is pitching to connect a strict linter to the build workflow. What I'm hearing pitched is the inclusion of a clang-format config file in which the ultimate style rules for the component reside. This is to be used as a helper tool, so that the developer can write his code however he wants and run this tool once at the end before submitting PR. Maybe I'm the one who's misinterpreting what @mcevoypeter and @matthew-levan would like to have.

My experience with these types has been overwhelmingly negative. Code run through these things is often not more readable, just mechanistically more consistent.

We're just going to have to agree to disagree here. I found this PR very readable.

strict characters per line, where I should and shouldn't put a comma, opinions on variable names, and var alignment, braces always or braces never, can't use 0xcafebabe literals and similar, etc etc ad nauseam

We already have most of these things in u3, they're just manually enforced so contributors have to actively think about them instead of knowing that there's an auto-format tool that'll do it for them. Also, just because Rust chose stupid rules doesn't mean we have to.

The only thing that you can hope for is consistency

Readability. Not consistency. the former is not context free. The latter is what linters enforce.

I don't think anyone is pitching to connect a strict linter to the build workflow

This is exactly the case in new mars now (recently). I don't think intent to integrate linting into the bazel build of vere is at all an unreasonable assumption. zorp-corp/sword#39

We're just going to have to agree to disagree here. I found this PR very readable

After some of the style changes Peter made, outside of a few extra long functions (made longer by 1 arg per line funcall formatting), I have few complaints about the general style of this PR. The below style I find slightly hard to read, but I didn't comment on it, because I really don't care. I do not, however, want such a style autistically enforced be it programmatic or informal.

if (mmap(ptr, kPageSz, PROT_READ, MAP_FIXED | MAP_PRIVATE, fd, offset_) == MAP_FAILED)

We already have most of these things in u3

We have style guidelines in u3, you're free to break them if reasonable. See some of the formatting of the pointer compression pr: https://github.com/urbit/vere/pull/164/files#diff-880188529bb675cec6511e9d295a25921cd0b0f95d7c7c4c14bb2c2dbbd2d6f7R2074.

just make the top-level code owner for each lib supreme styling dictator

this encourages a proliferation of libs. We have too many Caesars already. There should just be one benevolent vere dicatator. No one claims this title afaik, could be @joemfb or @belisarius222. (Or a state/deepstate type deal like with Linus and gregkh)

My biggest overall complaint with any of these efforts, is they just get in the way of writing code.

To quote Ryan Dahl: "If you think it would be cute to align all of the equals signs in your code, if you spend time configuring your window manager or editor, if put unicode check marks in your test runner, if you add unnecessary hierarchies in your code directories, if you are doing anything beyond just solving the problem - you don't understand how fucked the whole thing is"

Please don't add any more tools that yell at me, especially for trivial syntactical reasons. Please, in code reviews, avoid commenting on style, except for truly odd choices (like the _ suffix style which was addressed). Basic consistency is easily maintained. Who is harmed by slight deviations: a mul * unwrapped in spaces, a 1-line condition without swaddling braces, a switch case that allows fall through, etc?

I won't discuss this further in github comments. It's not relevant to this pr. We can settle it on the field with pistols. @joemfb is my second. Name yours ;)

pkg/pma/pma.c

matthew-levan · 2023-03-17T14:00:40Z

Hi Peter, this is neat! I've been working on fixing the bugs we have in our current snapshot system (namely the strange page issue [we merged a PR for this] and when we crash after a guard page re-centering error [still WIP]).

If you see #234 you can see that a reliable reproduction for the guard page crash is to boot a ship with 1.10 (%zuse 418) and then try and upgrade it to %zuse 415 with the latest binary.

I went ahead and ran the same process with a fakezod using a binary built with this branch i/188, and it crashed unfortunately. I got the following trace:
...
>   arvo: +load next
lull: ~lasbex-sonseg
zuse: ~ligfep-radfyn
vane: %ames: ~hatmeg-lavpun
ames: larva reload
vane: %behn: ~sogmut-tonmus
vane: %clay: ~fildux-polhes
           allocate: reclaim: half of 2139 entries
allocate: reclaim: half of 1069 entries
allocate: reclaim: half of 534 entries
allocate: reclaim: half of 267 entries
allocate: reclaim: half of 133 entries
allocate: reclaim: half of 66 entries
allocate: reclaim: half of 33 entries
allocate: reclaim: half of 16 entries
allocate: reclaim: half of 8 entries
allocate: reclaim: half of 4 entries
allocate: reclaim: half of 2 entries
allocate: reclaim: half of 1 entries
allocate: reclaim: memo cache: empty

bail: meme
allocate: reclaim: half of 2064 entries
pier: serf unexpectedly shut down

matthew@domus:~/ships/dev$
Is this PR meant to fix this specific issue? Perhaps the issue isn't related to the snapshot system at all? I haven't yet determined the root cause of this crash in particular, but I'm also still deep in my inspection of backtraces. What do you think?

More on this-- I am running the test again with gdb attached to both the king and serf processes, to see if and how the backtraces differ from before. Results may be interesting.

pkg/pma/pma.c

pkg/pma/pma.h

matthew-levan

The pma code is obviously clear and well-documented. Nice work there, I was able to fairly quickly read it and understand the system (I did glean over verification of the bit arithmetic though).

First, a question on style: How does this code fit in with the rest of the codebase? The PMA is written in idiomatic C, and vere is obviously far from that. Does having both idiomatic C and c3-style C in the same codebase introduce confusion? Genuinely curious about these thoughts from those with more experience with this codebase.

On design, will you please summarize your approach vs. the demand paging implementation in 1.14? It'd be nice to understand the differences and corresponding justifications/arguments.

Lastly, it's clear that this (more or less) works as intended: a few minutes after boot, a fresh fakezod will reduce its memory footprint to a mere ~75MB, which is quite an improvement! Neat.

mcevoypeter · 2023-03-20T17:26:27Z

First, a question on style: How does this code fit in with the rest of the codebase? The PMA is written in idiomatic C, and vere is obviously far from that. Does having both idiomatic C and c3-style C in the same codebase introduce confusion? Genuinely curious about these thoughts from those with more experience with this codebase.

The PMA is in a separate directory than the other directories which use c3-style C, so the styles aren't mixed within files, libraries, or binaries. If the PMA introduced a similarly unusual C style alongside the existing unusual c3 style, then I agree it might cause confusion, but idiomatic C is easy to read for any C programmer, making it a safe choice.

On design, will you please summarize your approach vs. the demand paging implementation in 1.14? It'd be nice to understand the differences and corresponding justifications/arguments.

This implementation creates file-backed mappings for north.bin and south.bin, whereas the 1.14 implementation only creates a file-backed mapping for north.bin.
This implementation lazily maps new pages, whereas the 1.14 implementation maps the entire loom up front.
This implementation uses a local SIGSEGV handler to only handle SIGSEGV generated by addresses within the loom, whereas the 1.14 implementation uses a global SIGSEGV handler that handles all SIGSEGV generated by the runtime.
This implementation defines the abstraction of a persistent memory arena, whereas the 1.14 implementation does not.

Lastly, it's clear that this (more or less) works as intended: a few minutes after boot, a fresh fakezod will reduce its memory footprint to a mere ~75MB, which is quite an improvement! Neat.

Nice! On a similar note, I'm running my planet and star with the changes, and I saw the memory usage of both drop from ~450MB to ~150MB upon switching when running top -pid <king_pid> -pid <serf_pid>.

matthew-levan · 2023-03-20T18:53:51Z

Awesome thanks for elaborating @mcevoypeter. I think this work has a lot of merits. What's left?

mcevoypeter · 2023-03-20T20:39:11Z

What's left?

#293 needs to be reviewed and then merged. The changed suggested only has the potential to adversely affect the sampling profiler, which is only used in development, so it's low risk in my opinion.

…lized

barter-simsum · 2023-03-30T17:18:02Z

Some combination of memory protections, advice, and remapping should be sufficient to keep the resident set small on any OS we support. If this turns out to be wrong, and Kubernetes (for instance) requires more aggressive intervention, we'll have to explicitly, lazily/incrementally request memory from the OS (in the allocator, on road transitions, via "guard" pages to grow, &c). But that will be quite a bit more work, and I don't believe it will be necessary.

I really don't think there's anything more aggressive than MADV_DONTNEED - at least Linux's interpretation of it - which is explicitly non lazy and advises the kernel that the process doesn't expect access in the near future. An immediate read on a region madvised with DONTNEED will read in zeroes as if the same region had been freshly mmaped anonymously (or the contents of the backing file, but that isn't relevant to our usage). This has the effect of immediately reducing RSS.

It's also possible that this is too aggressive and madvising with MADV_FREE (lazy) is better. I have doubts there will be a significant performance difference in vere. Additionally, since it's lazy, advising with FREE won't necessarily immediately drop rss. I have doubts though that that would actually be a problem for kubernetes. I'd imagine on memory pressure, it would try clawing back whatever resident memory it can from each linux container.

This reverts commit 3f4a31e.

@mopfel-winrux

After our group review, I've made a few fixes: - removed the `MADV_DONTNEED` immediately after mmap in pma_load. mmap doesn't make anything resident, so it's unnecessary. - removed the redundant `PS_MAPPED_INACCESSIBLE` state and instead check the guard page address, which we store anyway. - removed `PS_UNMAPPED` now that everything is always mapped. - This reduces the number of page states to two: clean or dirty. The default state is dirty, so that if we unexpectedly fault on a page (perhaps due to bad initialization), we will crash in `_handle_page_fault`. - At the end of pma_sync, we no longer loop to mark those pages as clean. It doesn't matter what state they're in, since `_append_dirty_pages` is bounded by the actual size of the heap/stack. - Instead, we `MADV_DONTNEED` the ephemeral space. This allows us to reclaim ephemeral memory on sync. As noted in the comment, this is likely fast enough to do after every event, at least if we use `MADV_FREE`, which is lazy. For now, we use the strict `MADV_DONTNEED` to make it easier to observe its behavior. This last change fixes the issue @mopfel-winrux reported here: #277 (comment)

philipcmonk · 2023-04-04T20:24:52Z

After our group review, I've made a few fixes:

removed the MADV_DONTNEED immediately after mmap in pma_load. mmap doesn't make anything resident, so it's unnecessary.
removed the redundant PS_MAPPED_INACCESSIBLE state and instead check the guard page address, which we store anyway.
removed PS_UNMAPPED now that everything is always mapped.
This reduces the number of page states to two: clean or dirty. The default state is dirty, so that if we unexpectedly fault on a page (perhaps due to bad initialization), we will crash in _handle_page_fault.
At the end of pma_sync, we no longer loop to mark those pages as clean. It doesn't matter what state they're in, since _append_dirty_pages is bounded by the actual size of the heap/stack.
Instead, we MADV_DONTNEED the ephemeral space. This allows us to reclaim ephemeral memory on sync. As noted in the comment, this is likely fast enough to do after every event, at least if we use MADV_FREE, which is lazy. For now, we use the strict MADV_DONTNEED to make it easier to observe its behavior.

This last change greatly reduces the average memory use of the process by fixing the issue @mopfel-winrux reported here: #277 (comment)

I believe these are all the blocking changes we identified, but we did not finish walking through the code. @joemfb and everyone else, let's continue that soon.

As a sanity benchmark, I tried refreshing groups (from a few weeks ago, before recent performance improvements), counting only 2nd and subsequent refreshes, both with and without memory pressure (running another ship on the same machine using 2.5G/3.7G, which uses 700MB swap on master). The margin of error is pretty high, but all results were between 15 and 22 seconds, with no discernable pattern (eg master under memory pressure gave both 16s and 22s). Thus, I don't believe this introduces any significant slowdown.

@mopfel-winrux

After our group review, I've made a few fixes: - removed the `MADV_DONTNEED` immediately after mmap in pma_load. mmap doesn't make anything resident, so it's unnecessary. - removed the redundant `PS_MAPPED_INACCESSIBLE` state and instead check the guard page address, which we store anyway. - removed `PS_UNMAPPED` now that everything is always mapped. - This reduces the number of page states to two: clean or dirty. The default state is dirty, so that if we unexpectedly fault on a page (perhaps due to bad initialization), we will crash in `_handle_page_fault`. - At the end of pma_sync, we no longer loop to mark those pages as clean. It doesn't matter what state they're in, since `_append_dirty_pages` is bounded by the actual size of the heap/stack. - Instead, we `MADV_DONTNEED` the ephemeral space. This allows us to reclaim ephemeral memory on sync. As noted in the comment, this is likely fast enough to do after every event, at least if we use `MADV_FREE`, which is lazy. For now, we use the strict `MADV_DONTNEED` to make it easier to observe its behavior. This last change fixes the issue @mopfel-winrux reported here: #277 (comment)

barter-simsum · 2023-04-04T22:40:27Z

We used to write the snapshots to .bhk every time we u3e_saved (by calling u3e_backup). We're no longer doing that in pma_sync. u3m_backup replaced the functionality of the former, but that only gets called on pier startup and on chop. Is that intentional? I didn't see it explicitly called out anywhere. @mcevoypeter @joemfb

-- IGNORE misread implementation of u3e_backup

philipcmonk · 2023-04-05T01:40:17Z

When did we start doing that? I remember bhk being used exclusively for when you upgraded to v1.8 (or thereabouts), and then it got repurposed to be used every chop, which seems proper. I don't remember it happening more often than that, and I feel like you don't want it to -- if you're constantly backing up, then unless corruption is caught immediately (in which case you don't need the backup), your backup will be corrupted.

barter-simsum · 2023-04-05T01:51:54Z

@philipcmonk

misread the u3e_backup implementation.

Looks like 3c13a6d added a call to _ce_backup which only backed up the pier if a bhk didn't already exist (fine).

origin/develop has some changes from Matt that changed _ce_backup to u3m_backup. A conditional is passed in the call that if c3n (it is in the u3e_save call) will exit early and not take the snapshot. That conditional is c3y (forces the backup) in the _cw_chop.

joemfb

Notes from a review of the WAL:

joemfb · 2023-04-05T19:21:55Z

pkg/vere/main.c

+                strerror(errno));
+        exit(ECANCELED);
+      }
+    }


This should be removed, it's not safe to unconditionally take a backup on startup (it may be corrupted).

joemfb · 2023-04-05T19:36:42Z

pkg/pma/wal.c

+    /// Global checksum.
+    uint64_t global_checksum;
+    /// WAL version number.
+    uint64_t version;


The version number should be the first member of the struct.

joemfb · 2023-04-05T19:40:56Z

pkg/pma/wal.c

+
+    // Don't include the header length in the entry count calculation.
+    if (meta_len > 0) {
+        meta_len -= sizeof(_metadata_hdr_t);


This could underflow.

joemfb · 2023-04-05T19:46:23Z

pkg/pma/wal.c

+                strerror(err));
+            goto fail;
+        }
+        assert(hdr.version == kWalVersion);


Need an error message on version mismatch.

joemfb · 2023-04-05T20:15:54Z

pkg/pma/wal.c

+    }
+
+    // Seek past the global checksum to the first metdata entry.
+    if (lseek(wal->meta_fd, sizeof(wal->checksum), SEEK_SET) == (off_t)-1) {


This should be sizeof the metadata header.

joemfb · 2023-04-05T20:24:44Z

pkg/pma/wal.c

+        return -1;
+    }
+
+    char              page[kPageIdxSz + kPageSz];


This declaration is vestigial.

joemfb · 2023-04-05T20:31:02Z

pkg/pma/wal.c

+_page_checksum(ssize_t pg_idx, const char pg[kPageSz])
+{
+    uint64_t pg_idx_checksum = 0;
+    MurmurHash3_x86_32(&pg_idx, kPageIdxSz, kSeed, &pg_idx_checksum);


The index can be included directly without hashing it first.

joemfb · 2023-04-05T20:35:34Z

pkg/pma/wal.c

+        goto fail;
+    }
+
+    if (write_all(wal->meta_fd, &wal->checksum, sizeof(wal->checksum)) == -1) {


This should also write the full metadata header, referring to the struct explicitly.

pkg/pma/pma.c

jalehman · 2023-04-12T15:07:38Z

Note to future self: @joemfb and @philipcmonk are working together on this offline.

This PR is a forward-port / rewrite of the demand-paging implementation from v1.14 (see urbit/urbit#6063, urbit/urbit#6126, urbit/urbit#6127, and urbit/urbit#6152). The original scope has been decreased, and the implementation simplified: i/o errors are not retried, the dirty page bitmap is manipulated with much simpler code, page offsets/pointers are calculated with macros, &c. There are additional layers of snapshot validation for updates (controlled at compile time, always on as of this PR); clean pages are compared to disk both before and after update. (This validation should stay on for pre-release testing, and possibly for initial release as well.) This PR has been tested extensively on live ships; the corruption issues that plagued v1.14 cannot be reproduced. Fixes #188. Supersedes #277. (Was previously opened as #401, but a typo in the branch name was preventing updates.)

mcevoypeter force-pushed the i/188 branch from 33b04b7 to 67d596c Compare March 7, 2023 19:38

mcevoypeter marked this pull request as ready for review March 7, 2023 20:55

mcevoypeter requested a review from a team as a code owner March 7, 2023 20:55

ashelkovnykov reviewed Mar 17, 2023

View reviewed changes