Advisory locking for metadata and artifacts files #2861

jku · 2025-08-29T08:05:33Z

Add advisory file locking to make it safer to run multiple updaters (using the same metadata directory) in separate processes. This should help with #2836

Metadata access is protected with <METADATA_DIR>/.lock: this lock is used in Updater.__init__(), Updater.refresh() and Updater.get_targetinfo() -- lock is typically held through the whole method (in other words while the metadata is being downloaded as well)
Artifact cache access is protected with a lock on the specific artifact file
No dependencies (whether this is a good idea remains to be seen, see later comments about complexity and alternatives)

@stefanberger comments welcome

Implementation notes

The implementation has just enough platform specific complexity that I'm not totally happy with it. There's clearly a chance of bugs here... Unfortunately I'm not convinced using a dependency would make that possibility go away

The main complexity is relates to the weird file access mechanism in Windows: mscvrt.locking() is the recommended API but 99% of the time it's not useful because we need a file handle to call it, and we can't get one because another process already has the file open. So we need a dumb loop that keeps calling open() and sleep(). It's ugly but no-one seems to have a better solution
the core difference between the two implementations is this: windows implementation will timeout in ~30 seconds if it does not get a lock. the posix implementation will keep waiting in fcntl.lockf() until the lock opens. Both have a tradeoff: On windows the updater might fail if another updater is a bit slow. On posix, if one updater hangs with a lock for some reason, all future updaters will just wait indefinitely until the original hung process is killed
There is a test that starts 10 processes that all run 50 metadata refreshes as fast as possible using the same metadata directory

TODO

I think the posix locking probably should timeout as well if someone is hogging the lock for long enough.
on the other hand, someone somewhere will always see the timeout (even though it would work some time later) whatever the we set the timeout to :(

Alternatives

https://github.com/tox-dev/filelock/ looks reasonable -- although artifact locking will be either more complicated or less parallel since you can't use filelock on the actual file you want to write.

This likely fails on all platforms right now, but the Windows behaviour cannot be fixed without actual locking. Signed-off-by: Jussi Kukkonen <[email protected]>

Signed-off-by: Jussi Kukkonen <[email protected]>

Use get_targetinfo() so that the delegated role loading is tested as well Signed-off-by: Jussi Kukkonen <[email protected]>

This should prevent issues with multiple processes trying to write at same time. Signed-off-by: Jussi Kukkonen <[email protected]>

Signed-off-by: Jussi Kukkonen <[email protected]>

Otherwise another process might delete the file underneath us Signed-off-by: Jussi Kukkonen <[email protected]>

Signed-off-by: Jussi Kukkonen <[email protected]>

There does not seem to be a way around a ugly loop over open()... Signed-off-by: Jussi Kukkonen <[email protected]>

Signed-off-by: Jussi Kukkonen <[email protected]>

The file locking should make multiple processes safe Signed-off-by: Jussi Kukkonen <[email protected]>

coveralls · 2025-08-29T08:08:45Z

coverage: 95.316% (-1.3%) from 96.603%
when pulling 55dbb53 on advisory-locking
into 7ad10ad on develop.

lukpueh · 2025-08-29T08:08:48Z

Will take a look next week Jussi Kukkonen ***@***.***> schrieb am Fr. 29. Aug. 2025 um 10:05:

…

@jku <https://github.com/jku> requested review from @theupdateframework/python-tuf-maintainers on: #2861 <#2861> Advisory locking for metadata and artifacts files as a code owner. — Reply to this email directly, view it on GitHub <#2861 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEP4DGGY4A3AFPOCIZH46L3QACWJAVCNFSM6AAAAACFDY6DBSVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJZGQYDMNRUGQ4TQNA> . You are receiving this because your review was requested.Message ID: ***@***.*** com>

stefanberger · 2025-08-31T16:41:41Z

On Windows I get a lot of sequences of the errors shown below. It looks like one process gets to lock the file in each one of the test loops but the other ones all have to wait for quite a while until another one starts locking for a while then.

How many files are you locking so that the Windows error 'Resource deadlock avoided' would be justified? I see only one file reported as being cause for 'a deadlock'.

Other than that I see no errors, so that's good.

(venv) C:\Users\StefanBerger\python-tuf>python
Python 3.13.7 (tags/v3.13.7:bcee1c3, Aug 14 2025, 14:15:11) [MSC v.1944 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from sigstore import sign
... while True:
...     sign.TrustedRoot.production()
...
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided

stefanberger · 2025-08-31T17:59:13Z

Update... All(!!!) 3 of the 3 test processes terminated like this:

  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 376, in _persist_file
    raise e
  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 369, in _persist_file
    os.replace(temp_file.name, filename)
    ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\StefanBerger\\AppData\\Local\\sigstore\\sigstore-python\\tuf\\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\\tmpzxnsj1mu' -> 'C:\\Users\\StefanBerger\\AppData\\Local\\sigstore\\sigstore-python\\tuf\\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\\root_history\\12.root.json'

stefanberger · 2025-08-31T18:13:24Z

tuf/ngclient/_internal/file_lock.py

+                    yield f
+                    return
+            except FileNotFoundError:
+                # could be from yield or from open() -- either way we bail


You are opening the file for wring. Should this ever lead to a FileNotFoundError?

This can come from either :

yield (another part the Updater code did not find some file when it expected one)

or the open() a few lines above if the parent directory does not exist (something that can mostly happen in test cases since we do create the directory slightly before this... but it can happen)

jku · 2025-09-01T06:29:18Z

On Windows I get a lot of sequences of the errors shown below.

I've left the "Unsuccessful lock attempt" as a warning on Windows -- I don't work on windows so I don't really know how common this is but my assumption is it does not come up in normal use: more details below.

It looks like one process gets to lock the file in each one of the test loops but the other ones all have to wait for quite a while until another one starts locking for a while then.

Can you explain how this differs from what you expected? Overall that sounds like how file locking is supposed to work.

If you are seeing a case where many updaters run successfully in one process while the updater in another process remains locked out for longer than seems "statistically" reasonable: yeah, that's how msvcrt has decided to implement locking(): it mostly sleeps and does not try to get a lock -- this likely works fine in normal usage but looks odd when stress testing as it seems like one process is just hogging the lock: In reality it's releasing the lock and then taking another lock while the other processes are still sleeping.

How many files are you locking so that the Windows error 'Resource deadlock avoided' would be justified? I see only one file reported as being cause for 'a deadlock'.

I'm not sure I understand the question but

access to all metadata files is protected with one lock on <METADATA_DIR>/.lock
artifact access is protected with individual locks on the artifact files themselves

I believe that seeing 'Resource deadlock avoided' in convoluted test cases (that repeatedly start Updaters) is unavoidable: The typical way that 'Resource deadlock avoided' happens is

open unexpectedly succeeds even though another process has the file open already -- this happens when the system is under load (as it would be in a test case)
msvcrt.locking() tries 10 times to get a lock during 9 seconds, but each time another process has the lock
at this point msvcrt.locking() gives up and we output the warning. Then we start again a bit later

Note that this does not mean that a single process had the lock for 9 seconds: there could have been a hundred different locks during that time.

I've left the 'Resource deadlock avoided' as a logger.warning() for added visibility: I would not expect it to be seen in normal usage (but we could make sure we don't print too many of those even in extreme cases)

All(!!!) 3 of the 3 test processes terminated like this:

I'm going to need a little more details to reproduce this -- preferably a test case that runs on CI

stefanberger · 2025-09-01T22:25:51Z

On Windows I get a lot of sequences of the errors shown below.

I've left the "Unsuccessful lock attempt" as a warning on Windows -- I don't work on windows so I don't really know how common this is but my assumption is it does not come up in normal use: more details below.

It looks like one process gets to lock the file in each one of the test loops but the other ones all have to wait for quite a while until another one starts locking for a while then.

Can you explain how this differs from what you expected? Overall that sounds like how file locking is supposed to work.

It's like one process is blocking all the other ones for a considerable amount of time, which is weird. I thought they would all get their fair amount of access to the lock.

If you are seeing a case where many updaters run successfully in one process while the updater in another process remains locked out for longer than seems "statistically" reasonable: yeah, that's how msvcrt has decided to implement locking(): it mostly sleeps and does not try to get a lock -- this likely works fine in normal usage but looks odd when stress testing as it seems like one process is just hogging the lock: In reality it's releasing the lock and then taking another lock while the other processes are still sleeping.

Exactly.

How many files are you locking so that the Windows error 'Resource deadlock avoided' would be justified? I see only one file reported as being cause for 'a deadlock'.

I'm not sure I understand the question but
* access to all metadata files is protected with one lock on `<METADATA_DIR>/.lock`

* artifact access is protected with individual locks on the artifact files themselves
I believe that seeing 'Resource deadlock avoided' in convoluted test cases (that repeatedly start Updaters) is unavoidable: The typical way that 'Resource deadlock avoided' happens is
* open unexpectedly succeeds even though another process has the file open already -- this happens when the system is under load (as it would be  in a test case)

* msvcrt.locking() tries 10 times to get a lock during 9 seconds, but each time another process has the lock

I guess the sleeping in user space explains it.

* at this point msvcrt.locking() gives up and we output the warning. Then we start again a bit later

Ok.

Note that this does not mean that a single process had the lock for 9 seconds: there could have been a hundred different locks during that time.

I've left the 'Resource deadlock avoided' as a logger.warning() for added visibility: I would not expect it to be seen in normal usage (but we could make sure we don't print too many of those even in extreme cases)

All(!!!) 3 of the 3 test processes terminated like this:

I'm going to need a little more details to reproduce this -- preferably a test case that runs on CI

It happens with that sigstore loop I am using for testing but now it takes a very long time for this to happen.

jku · 2025-09-02T07:36:36Z

All(!!!) 3 of the 3 test processes terminated like this:

I'm going to need a little more details to reproduce this -- preferably a test case that runs on CI

It happens with that sigstore loop I am using for testing but now it takes a very long time for this to happen.

I'm not that worried if this only happens after obnoxious amounts of unrealistic load that can't be reproduced in CI: Like I said the windows code is currently designed to fail eventually anyway and I think that's fine.

That said, the error you mention seems unexpected, I can have a look... could you provide the full error? The snippet does not yet identify how this happens exactly.

stefanberger · 2025-09-02T12:28:46Z

That said, the error you mention seems unexpected, I can have a look... could you provide the full error? The snippet does not yet identify how this happens exactly.

This is the test I am running in 3 python interpreters for I would say like 15 minutes or so:

>>> from sigstore import sign
... while True:
...     sign.TrustedRoot.production()
...

One of them broke as shown below. Currently, the other two tests are still running but have always broken in the same way, even the last one that basically runs 'alone'. I would say this is unlikely a bug from this PR.

<sigstore._internal.trust.TrustedRoot object at 0x000001E05273A6C0>
<sigstore._internal.trust.TrustedRoot object at 0x000001E05278D2C0>
Traceback (most recent call last):
  File "<python-input-4>", line 3, in <module>
    sign.TrustedRoot.production()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "C:\Users\StefanBerger\python-tuf\venv\Lib\site-packages\sigstore\_internal\trust.py", line 357, in production
    return cls.from_tuf(DEFAULT_TUF_URL, offline)
           ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\StefanBerger\python-tuf\venv\Lib\site-packages\sigstore\_internal\trust.py", line 344, in from_tuf
    path = TrustUpdater(url, offline).get_trusted_root_path()
           ~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "C:\Users\StefanBerger\python-tuf\venv\Lib\site-packages\sigstore\_internal\tuf.py", line 116, in __init__
    self._updater = Updater(
                    ~~~~~~~^
        metadata_dir=str(self._metadata_dir),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
        bootstrap=root_json,
        ^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 146, in __init__
    self._persist_root(self._trusted_set.root.version, bootstrap)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 357, in _persist_root
    self._persist_file(str(rootdir / f"{version}.root.json"), data)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 376, in _persist_file
    raise e
  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 369, in _persist_file
    os.replace(temp_file.name, filename)
    ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\StefanBerger\\AppData\\Local\\sigstore\\sigstore-python\\tuf\\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\\tmpgz9qivm6' -> 'C:\\Users\\StefanBerger\\AppData\\Local\\sigstore\\sigstore-python\\tuf\\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\\root_history\\12.root.json'

jku · 2025-09-02T14:41:53Z

thanks. That is definitely a case where

we have a lock so no other process should have any metadata files open
but we still get permission denied when doing os.replace()

Not amazing but unless we can reproduce that with some reasonable amount of load I wouldn't worry too much -- that file was definitely opened thousands if not tens of thousands of times in that test.

I do wonder if we should try to avoid unnecessary writes in these cases -- the initial root is the same one (in the 99.9% happy path) so we could just compare the content and avoid writing if it's the same...

EDIT: avoiding writes during init is not too hard. It does not mean any less open()s but avoiding writing feels right.

jku · 2025-09-03T08:22:22Z

Status:

the "optimization" to avoid writing initial root and the symlink if they are correct already seems useful: I can do that
posix and windows currently operate differently: posix will just wait for the lock for as long as it takes, windows will fail after some time if it does not get a lock.
- I could make posix work like windows (to avoid the possibility of waiting forever)... but that gives us the downsides as well: we would then just make non-blocking checks for "is the lock free now?" and sleep for most of the time (versus letting the OS wake us when the lock is available). The worst case scenario (under unrealistic load) likely looks like what Stefan has described above for windows
under enough load at least the windows version will break: I'm not too worried about this, flawless execution at unrealistic loads is not a goal here. Writing/reading can't succeed absolutely every time.
I'm still a little worried about bugs though
Using filelock from tox-dev is still an option

jku · 2025-09-07T12:07:38Z

posix and windows currently operate differently: posix will just wait for the lock for as long as it takes, windows will fail after some time if it does not get a lock.

I could make posix work like windows (to avoid the possibility of waiting forever)... but that gives us the downsides as well: we would then just make non-blocking checks for "is the lock free now?" and sleep for most of the time (versus letting the OS wake us when the lock is available). The worst case scenario (under unrealistic load) likely looks like what Stefan has described above for windows

I think I will at least try the non-blocking version on posix too: this makes the failure mode easier for users as we can print the lock file path in the timeout error message so user can just delete the file if they want (essentially overriding the current lock).

lukpueh

Great work, @jku!

I didn't find any problems in the lock implementation, i.e. the contextmanagers. The main question is, if we indeed cover all code regions that should be locked. But I think it's fine.

Regarding timeout on POSIX, I probably wouldn't bother. Right now, if an updater process hangs, it won't timeout either.

lukpueh · 2025-09-12T06:49:49Z

tuf/ngclient/_internal/file_lock.py

+            yield f
+
+except ModuleNotFoundError:
+    # Windows file locking, in belt-and-suspenders-from-Temu style:


lukpueh · 2025-09-12T07:36:11Z

tuf/ngclient/_internal/file_lock.py

+    @contextmanager
+    def lock_file(path: str) -> Iterator[IO]:
+        with open(path, "wb") as f:
+            fcntl.lockf(f, fcntl.LOCK_EX)


It took me a while to realise that closing the file releases the lock (even after reading fcntl docs). Maybe this is obvious to others. But a brief comment about how this works might be helpful.

lukpueh · 2025-09-12T07:47:43Z

tuf/ngclient/_internal/file_lock.py

+        for _ in range(100):
+            try:
+                with open(path, "wb") as f:
+                    msvcrt.locking(f.fileno(), msvcrt.LK_LOCK, 1)


Can you lock 1 byte, in an empty file? I guess you can.

lukpueh · 2025-09-12T07:54:55Z

tuf/ngclient/_internal/file_lock.py

+    # * msvcrt.locking() does not even block until file is available: it just
+    #   tries once per second in a non-blocking manner for 10 seconds. So if
+    #   another process keeps opening the file it's unlikely that we actually
+    #   get the lock


So this means the timeout could be anything between 10 and 40 seconds? Would it make sense to use LK_NBLCK instead of LK_LOCK to fully control the timeout?

lukpueh · 2025-09-12T08:07:05Z

tuf/ngclient/updater.py

-        return self._preorder_depth_first_walk(target_path)
+        with self._lock_metadata():
+            if Targets.type not in self._trusted_set:
+                # implicit refresh


You're not calling refresh, which has its own lock, to include pdf walk in the same lock, right?

jku added 10 commits August 22, 2025 18:37

Add test for parallel refresh

53cc81b

This likely fails on all platforms right now, but the Windows behaviour cannot be fixed without actual locking. Signed-off-by: Jussi Kukkonen <[email protected]>

ngclient: Advisory locking, first draft

eeb59f8

Signed-off-by: Jussi Kukkonen <[email protected]>

tests: Expand parallel refresh test

6d66696

Use get_targetinfo() so that the delegated role loading is tested as well Signed-off-by: Jussi Kukkonen <[email protected]>

ngclient: Advisory locking for artifacts

7a8edd9

This should prevent issues with multiple processes trying to write at same time. Signed-off-by: Jussi Kukkonen <[email protected]>

lint fixes

b63cf71

Signed-off-by: Jussi Kukkonen <[email protected]>

ngclient: Move bootstrap root loading inside lock

ba3adef

Otherwise another process might delete the file underneath us Signed-off-by: Jussi Kukkonen <[email protected]>

tests: Fix check to be compatible with .lock

cbe34d9

Signed-off-by: Jussi Kukkonen <[email protected]>

ngclient: Fix the lockfile handling in Windows

ba0842f

There does not seem to be a way around a ugly loop over open()... Signed-off-by: Jussi Kukkonen <[email protected]>

ngclient: Refactor lock file implementation

5f467bb

Signed-off-by: Jussi Kukkonen <[email protected]>

ngclient: Remove the mention of "single instance"

55dbb53

The file locking should make multiple processes safe Signed-off-by: Jussi Kukkonen <[email protected]>

jku requested a review from a team as a code owner August 29, 2025 08:05

jku marked this pull request as draft August 29, 2025 08:05

jku mentioned this pull request Aug 29, 2025

Use a lock file to avoid exceptions due to concurrenct symlink creation #2851

Closed

stefanberger reviewed Aug 31, 2025

View reviewed changes

lukpueh reviewed Sep 12, 2025

View reviewed changes

Advisory locking for metadata and artifacts files #2861

Are you sure you want to change the base?

Advisory locking for metadata and artifacts files #2861

Uh oh!

Conversation

jku commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation notes

TODO

Alternatives

Uh oh!

coveralls commented Aug 29, 2025

Uh oh!

lukpueh commented Aug 29, 2025 via email

Uh oh!

stefanberger commented Aug 31, 2025

Uh oh!

stefanberger commented Aug 31, 2025

Uh oh!

stefanberger Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

jku Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jku commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefanberger commented Sep 1, 2025

Uh oh!

jku commented Sep 2, 2025

Uh oh!

stefanberger commented Sep 2, 2025

Uh oh!

jku commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jku commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jku commented Sep 7, 2025

Uh oh!

lukpueh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukpueh Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

lukpueh Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

lukpueh Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

lukpueh Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

lukpueh Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jku commented Aug 29, 2025 •

edited

Loading

jku Sep 1, 2025 •

edited

Loading

jku commented Sep 1, 2025 •

edited

Loading

jku commented Sep 2, 2025 •

edited

Loading

jku commented Sep 3, 2025 •

edited

Loading

lukpueh left a comment •

edited

Loading