Thanks to visit codestin.com
Credit goes to github.com

Skip to content

draft: use db instead of storing items to be handled in memory while pulling#10183

Draft
imsodin wants to merge 14 commits intosyncthing:mainfrom
imsodin:simplify-pull
Draft

draft: use db instead of storing items to be handled in memory while pulling#10183
imsodin wants to merge 14 commits intosyncthing:mainfrom
imsodin:simplify-pull

Conversation

@imsodin
Copy link
Member

@imsodin imsodin commented Jun 16, 2025

This is not ready to be merged, both because it's likely not finished and contains changes that probably shouldn't be there. I just realized it was a bit pointless not to show what I am talking about on the forum just because it's not ready, especially because I likely wont have time to get it anywhere closer to ready this week.

@calmh
Copy link
Member

calmh commented Jun 17, 2025

This could do with just a couple of sentences describing the new intended mechanism, for context when sifting through the diffs :)

@imsodin
Copy link
Member Author

imsodin commented Jun 17, 2025

To make the diff at least not entirely unreadable, looks at these two sequences of commits separately - resp only the second is interesting:

Some quick pointers for that "actual change":

  1. I remove the deletion slices/maps and the file queue, instead using AllNeededGlobalFiles every/three times.
  2. With the deletion map I also use the bucket used to detect renames. Instead for every file to be pulled, I now check for already exisiting local files with the same hash with AllLocalFilesWithBlocksHash and check if any of those will be deleted with GetGlobalFile.
  3. As the files to be pulled aren't in the queue anymore, we also can't serve the needed files straight from that. That happens from the DB with a newly introduced variant of AllNeededGlobalFiles that returns metadata only for efficiency.

The changes above are already a lot, but then there were also some inconsistencies I just changed/fixed/papered over on the way plus some random actions along the way - that all needs weeding out. Again, this PR is very much just for reference if you happen to want to check out something about it, I wouldn't recommend investing time into it otherwise yet.

@calmh
Copy link
Member

calmh commented Jun 17, 2025

Right, it sounds mostly reasonable, I need to look at the details of course, two things that strike me out of the box:

  • You can't do a lot of work while holding a db iterator, because all writes go to the wal file while a select is open. This was one of the initial points of contention with the SQLite implementation that the wal file can grow unbounded if we're receiving updates or making changes with an open iterator. Short iterator passes are key (in time, so think "this should take at most a couple of seconds" when looping).
  • There may be consistency issues with the multiple passes, like we create needed directories in pass one, but pass two may include more files and their required directories than we knew about in pass one...

@imsodin
Copy link
Member Author

imsodin commented Jun 17, 2025

Argh, yeah the first aspect will likely be problematic with my approach. Actually it is problematic as is for sure, nothing short-lived about what happens during the second phase of the iteration, but probably solvable. As we do ordered passes, maybe I will just add a timeout, release and start a new iteration? I'll have to think about this a bit more and try some things.

The second one I did consider but from what I see we diligently do checks for issues like that. As in an inconsistency can happen, but it wont cause problems beyond single items failing, which will then be redone/fixed on the next pull iteration.

@calmh
Copy link
Member

calmh commented Jun 17, 2025

I think targeted queries with limit clauses can work around some of it, e.g. grab the first 25 needed items of type directory and process those, grab the first 25 files and process them in memory, etc.

@imsodin
Copy link
Member Author

imsodin commented Jun 17, 2025

I was thinking timeout as an item count isn't necessarily well correlated with time spent processing. Anyway, details - we could always combine both if necessary. However what seems simple at first isn't I think: We don't handle all the needed items when iterating, so when we iterate say the first 50, we may handle 10 right away. So next iteration we should skip the first 40 needed items. We'd have to keep track of that, which is already annoying. Plus other changes like index updates might also throw a spanner in. As we do an ordered iteration, we could remember the last handled file, and then instead of using an offset search until that one - better, but also not great (Edit: Also doesn't work, we might have handled that file -> no longer present. So we'd have to use the value of the ordered field, urgh). Ideally we'd have read snapshots :P
Or got back to having a queue, but put it in a (separate) DB. As in do one main DB iteration pass to put all relevant info into that separate db, then operate on that until done -> discard. That's a bit ugly/heavyhanded though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants