-
Notifications
You must be signed in to change notification settings - Fork 280
Advisory locking for metadata and artifacts files #2861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
This likely fails on all platforms right now, but the Windows behaviour cannot be fixed without actual locking. Signed-off-by: Jussi Kukkonen <[email protected]>
Signed-off-by: Jussi Kukkonen <[email protected]>
Use get_targetinfo() so that the delegated role loading is tested as well Signed-off-by: Jussi Kukkonen <[email protected]>
This should prevent issues with multiple processes trying to write at same time. Signed-off-by: Jussi Kukkonen <[email protected]>
Signed-off-by: Jussi Kukkonen <[email protected]>
Otherwise another process might delete the file underneath us Signed-off-by: Jussi Kukkonen <[email protected]>
Signed-off-by: Jussi Kukkonen <[email protected]>
There does not seem to be a way around a ugly loop over open()... Signed-off-by: Jussi Kukkonen <[email protected]>
Signed-off-by: Jussi Kukkonen <[email protected]>
The file locking should make multiple processes safe Signed-off-by: Jussi Kukkonen <[email protected]>
Will take a look next week
Jussi Kukkonen ***@***.***> schrieb am Fr. 29. Aug. 2025 um
10:05:
… @jku <https://github.com/jku> requested review from
@theupdateframework/python-tuf-maintainers on: #2861
<#2861> Advisory
locking for metadata and artifacts files as a code owner.
—
Reply to this email directly, view it on GitHub
<#2861 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEP4DGGY4A3AFPOCIZH46L3QACWJAVCNFSM6AAAAACFDY6DBSVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJZGQYDMNRUGQ4TQNA>
.
You are receiving this because your review was requested.Message ID:
***@***.***
com>
|
On Windows I get a lot of sequences of the errors shown below. It looks like one process gets to lock the file in each one of the test loops but the other ones all have to wait for quite a while until another one starts locking for a while then. How many files are you locking so that the Windows error 'Resource deadlock avoided' would be justified? I see only one file reported as being cause for 'a deadlock'. Other than that I see no errors, so that's good.
|
Update... All(!!!) 3 of the 3 test processes terminated like this:
|
yield f | ||
return | ||
except FileNotFoundError: | ||
# could be from yield or from open() -- either way we bail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are opening the file for wring. Should this ever lead to a FileNotFoundError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can come from either :
- yield (another part the Updater code did not find some file when it expected one)
- or the
open()
a few lines above if the parent directory does not exist (something that can mostly happen in test cases since we do create the directory slightly before this... but it can happen)
I've left the "Unsuccessful lock attempt" as a warning on Windows -- I don't work on windows so I don't really know how common this is but my assumption is it does not come up in normal use: more details below.
Can you explain how this differs from what you expected? Overall that sounds like how file locking is supposed to work. If you are seeing a case where many updaters run successfully in one process while the updater in another process remains locked out for longer than seems "statistically" reasonable: yeah, that's how msvcrt has decided to implement
I'm not sure I understand the question but
I believe that seeing 'Resource deadlock avoided' in convoluted test cases (that repeatedly start Updaters) is unavoidable: The typical way that 'Resource deadlock avoided' happens is
Note that this does not mean that a single process had the lock for 9 seconds: there could have been a hundred different locks during that time. I've left the 'Resource deadlock avoided' as a
I'm going to need a little more details to reproduce this -- preferably a test case that runs on CI |
It's like one process is blocking all the other ones for a considerable amount of time, which is weird. I thought they would all get their fair amount of access to the lock.
Exactly.
I guess the sleeping in user space explains it.
Ok.
It happens with that sigstore loop I am using for testing but now it takes a very long time for this to happen. |
I'm not that worried if this only happens after obnoxious amounts of unrealistic load that can't be reproduced in CI: Like I said the windows code is currently designed to fail eventually anyway and I think that's fine. That said, the error you mention seems unexpected, I can have a look... could you provide the full error? The snippet does not yet identify how this happens exactly. |
This is the test I am running in 3 python interpreters for I would say like 15 minutes or so:
One of them broke as shown below. Currently, the other two tests are still running but have always broken in the same way, even the last one that basically runs 'alone'. I would say this is unlikely a bug from this PR.
|
thanks. That is definitely a case where
Not amazing but unless we can reproduce that with some reasonable amount of load I wouldn't worry too much -- that file was definitely opened thousands if not tens of thousands of times in that test. I do wonder if we should try to avoid unnecessary writes in these cases -- the initial root is the same one (in the 99.9% happy path) so we could just compare the content and avoid writing if it's the same... EDIT: avoiding writes during init is not too hard. It does not mean any less open()s but avoiding writing feels right. |
Status:
|
I think I will at least try the non-blocking version on posix too: this makes the failure mode easier for users as we can print the lock file path in the timeout error message so user can just delete the file if they want (essentially overriding the current lock). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @jku!
I didn't find any problems in the lock implementation, i.e. the contextmanagers. The main question is, if we indeed cover all code regions that should be locked. But I think it's fine.
Regarding timeout on POSIX, I probably wouldn't bother. Right now, if an updater process hangs, it won't timeout either.
yield f | ||
|
||
except ModuleNotFoundError: | ||
# Windows file locking, in belt-and-suspenders-from-Temu style: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
@contextmanager | ||
def lock_file(path: str) -> Iterator[IO]: | ||
with open(path, "wb") as f: | ||
fcntl.lockf(f, fcntl.LOCK_EX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to realise that closing the file releases the lock (even after reading fcntl docs). Maybe this is obvious to others. But a brief comment about how this works might be helpful.
for _ in range(100): | ||
try: | ||
with open(path, "wb") as f: | ||
msvcrt.locking(f.fileno(), msvcrt.LK_LOCK, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you lock 1 byte, in an empty file? I guess you can.
# * msvcrt.locking() does not even block until file is available: it just | ||
# tries once per second in a non-blocking manner for 10 seconds. So if | ||
# another process keeps opening the file it's unlikely that we actually | ||
# get the lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this means the timeout could be anything between 10 and 40 seconds? Would it make sense to use LK_NBLCK
instead of LK_LOCK
to fully control the timeout?
return self._preorder_depth_first_walk(target_path) | ||
with self._lock_metadata(): | ||
if Targets.type not in self._trusted_set: | ||
# implicit refresh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not calling refresh, which has its own lock, to include pdf walk in the same lock, right?
Add advisory file locking to make it safer to run multiple updaters (using the same metadata directory) in separate processes. This should help with #2836
<METADATA_DIR>/.lock
: this lock is used inUpdater.__init__()
,Updater.refresh()
andUpdater.get_targetinfo()
-- lock is typically held through the whole method (in other words while the metadata is being downloaded as well)@stefanberger comments welcome
Implementation notes
The implementation has just enough platform specific complexity that I'm not totally happy with it. There's clearly a chance of bugs here... Unfortunately I'm not convinced using a dependency would make that possibility go away
mscvrt.locking()
is the recommended API but 99% of the time it's not useful because we need a file handle to call it, and we can't get one because another process already has the file open. So we need a dumb loop that keeps callingopen()
andsleep()
. It's ugly but no-one seems to have a better solutionfcntl.lockf()
until the lock opens. Both have a tradeoff: On windows the updater might fail if another updater is a bit slow. On posix, if one updater hangs with a lock for some reason, all future updaters will just wait indefinitely until the original hung process is killedTODO
Alternatives
https://github.com/tox-dev/filelock/ looks reasonable -- although artifact locking will be either more complicated or less parallel since you can't use filelock on the actual file you want to write.