Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

jku
Copy link
Member

@jku jku commented Aug 29, 2025

Add advisory file locking to make it safer to run multiple updaters (using the same metadata directory) in separate processes. This should help with #2836

  • Metadata access is protected with <METADATA_DIR>/.lock: this lock is used in Updater.__init__(), Updater.refresh() and Updater.get_targetinfo() -- lock is typically held through the whole method (in other words while the metadata is being downloaded as well)
  • Artifact cache access is protected with a lock on the specific artifact file
  • No dependencies (whether this is a good idea remains to be seen, see later comments about complexity and alternatives)

@stefanberger comments welcome

Implementation notes

The implementation has just enough platform specific complexity that I'm not totally happy with it. There's clearly a chance of bugs here... Unfortunately I'm not convinced using a dependency would make that possibility go away

  • The main complexity is relates to the weird file access mechanism in Windows: mscvrt.locking() is the recommended API but 99% of the time it's not useful because we need a file handle to call it, and we can't get one because another process already has the file open. So we need a dumb loop that keeps calling open() and sleep(). It's ugly but no-one seems to have a better solution
  • the core difference between the two implementations is this: windows implementation will timeout in ~30 seconds if it does not get a lock. the posix implementation will keep waiting in fcntl.lockf() until the lock opens. Both have a tradeoff: On windows the updater might fail if another updater is a bit slow. On posix, if one updater hangs with a lock for some reason, all future updaters will just wait indefinitely until the original hung process is killed
  • There is a test that starts 10 processes that all run 50 metadata refreshes as fast as possible using the same metadata directory

TODO

  • I think the posix locking probably should timeout as well if someone is hogging the lock for long enough.
  • on the other hand, someone somewhere will always see the timeout (even though it would work some time later) whatever the we set the timeout to :(

Alternatives

https://github.com/tox-dev/filelock/ looks reasonable -- although artifact locking will be either more complicated or less parallel since you can't use filelock on the actual file you want to write.

jku added 10 commits August 22, 2025 18:37
This likely fails on all platforms right now, but the Windows
behaviour cannot be fixed without actual locking.

Signed-off-by: Jussi Kukkonen <[email protected]>
Use get_targetinfo() so that the delegated role loading is tested as
well

Signed-off-by: Jussi Kukkonen <[email protected]>
This should prevent issues with multiple processes trying to write at
same time.

Signed-off-by: Jussi Kukkonen <[email protected]>
Signed-off-by: Jussi Kukkonen <[email protected]>
Otherwise another process might delete the file underneath us

Signed-off-by: Jussi Kukkonen <[email protected]>
There does not seem to be a way around a ugly loop over open()...

Signed-off-by: Jussi Kukkonen <[email protected]>
The file locking should make multiple processes safe

Signed-off-by: Jussi Kukkonen <[email protected]>
@jku jku requested a review from a team as a code owner August 29, 2025 08:05
@jku jku marked this pull request as draft August 29, 2025 08:05
@coveralls
Copy link

Coverage Status

coverage: 95.316% (-1.3%) from 96.603%
when pulling 55dbb53 on advisory-locking
into 7ad10ad on develop.

@lukpueh
Copy link
Member

lukpueh commented Aug 29, 2025 via email

@stefanberger
Copy link

On Windows I get a lot of sequences of the errors shown below. It looks like one process gets to lock the file in each one of the test loops but the other ones all have to wait for quite a while until another one starts locking for a while then.

How many files are you locking so that the Windows error 'Resource deadlock avoided' would be justified? I see only one file reported as being cause for 'a deadlock'.

Other than that I see no errors, so that's good.

(venv) C:\Users\StefanBerger\python-tuf>python
Python 3.13.7 (tags/v3.13.7:bcee1c3, Aug 14 2025, 14:15:11) [MSC v.1944 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from sigstore import sign
... while True:
...     sign.TrustedRoot.production()
...
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided

@stefanberger
Copy link

Update... All(!!!) 3 of the 3 test processes terminated like this:

  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 376, in _persist_file
    raise e
  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 369, in _persist_file
    os.replace(temp_file.name, filename)
    ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\StefanBerger\\AppData\\Local\\sigstore\\sigstore-python\\tuf\\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\\tmpzxnsj1mu' -> 'C:\\Users\\StefanBerger\\AppData\\Local\\sigstore\\sigstore-python\\tuf\\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\\root_history\\12.root.json'

yield f
return
except FileNotFoundError:
# could be from yield or from open() -- either way we bail

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are opening the file for wring. Should this ever lead to a FileNotFoundError?

Copy link
Member Author

@jku jku Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can come from either :

  • yield (another part the Updater code did not find some file when it expected one)
  • or the open() a few lines above if the parent directory does not exist (something that can mostly happen in test cases since we do create the directory slightly before this... but it can happen)

@jku
Copy link
Member Author

jku commented Sep 1, 2025

On Windows I get a lot of sequences of the errors shown below.

I've left the "Unsuccessful lock attempt" as a warning on Windows -- I don't work on windows so I don't really know how common this is but my assumption is it does not come up in normal use: more details below.

It looks like one process gets to lock the file in each one of the test loops but the other ones all have to wait for quite a while until another one starts locking for a while then.

Can you explain how this differs from what you expected? Overall that sounds like how file locking is supposed to work.

If you are seeing a case where many updaters run successfully in one process while the updater in another process remains locked out for longer than seems "statistically" reasonable: yeah, that's how msvcrt has decided to implement locking(): it mostly sleeps and does not try to get a lock -- this likely works fine in normal usage but looks odd when stress testing as it seems like one process is just hogging the lock: In reality it's releasing the lock and then taking another lock while the other processes are still sleeping.

How many files are you locking so that the Windows error 'Resource deadlock avoided' would be justified? I see only one file reported as being cause for 'a deadlock'.

I'm not sure I understand the question but

  • access to all metadata files is protected with one lock on <METADATA_DIR>/.lock
  • artifact access is protected with individual locks on the artifact files themselves

I believe that seeing 'Resource deadlock avoided' in convoluted test cases (that repeatedly start Updaters) is unavoidable: The typical way that 'Resource deadlock avoided' happens is

  • open unexpectedly succeeds even though another process has the file open already -- this happens when the system is under load (as it would be in a test case)
  • msvcrt.locking() tries 10 times to get a lock during 9 seconds, but each time another process has the lock
  • at this point msvcrt.locking() gives up and we output the warning. Then we start again a bit later

Note that this does not mean that a single process had the lock for 9 seconds: there could have been a hundred different locks during that time.

I've left the 'Resource deadlock avoided' as a logger.warning() for added visibility: I would not expect it to be seen in normal usage (but we could make sure we don't print too many of those even in extreme cases)

All(!!!) 3 of the 3 test processes terminated like this:

I'm going to need a little more details to reproduce this -- preferably a test case that runs on CI

@stefanberger
Copy link

On Windows I get a lot of sequences of the errors shown below.

I've left the "Unsuccessful lock attempt" as a warning on Windows -- I don't work on windows so I don't really know how common this is but my assumption is it does not come up in normal use: more details below.

It looks like one process gets to lock the file in each one of the test loops but the other ones all have to wait for quite a while until another one starts locking for a while then.

Can you explain how this differs from what you expected? Overall that sounds like how file locking is supposed to work.

It's like one process is blocking all the other ones for a considerable amount of time, which is weird. I thought they would all get their fair amount of access to the lock.

If you are seeing a case where many updaters run successfully in one process while the updater in another process remains locked out for longer than seems "statistically" reasonable: yeah, that's how msvcrt has decided to implement locking(): it mostly sleeps and does not try to get a lock -- this likely works fine in normal usage but looks odd when stress testing as it seems like one process is just hogging the lock: In reality it's releasing the lock and then taking another lock while the other processes are still sleeping.

Exactly.

How many files are you locking so that the Windows error 'Resource deadlock avoided' would be justified? I see only one file reported as being cause for 'a deadlock'.

I'm not sure I understand the question but

* access to all metadata files is protected with one lock on `<METADATA_DIR>/.lock`

* artifact access is protected with individual locks on the artifact files themselves

I believe that seeing 'Resource deadlock avoided' in convoluted test cases (that repeatedly start Updaters) is unavoidable: The typical way that 'Resource deadlock avoided' happens is

* open unexpectedly succeeds even though another process has the file open already -- this happens when the system is under load (as it would be  in a test case)

* msvcrt.locking() tries 10 times to get a lock during 9 seconds, but each time another process has the lock

I guess the sleeping in user space explains it.

* at this point msvcrt.locking() gives up and we output the warning. Then we start again a bit later

Ok.

Note that this does not mean that a single process had the lock for 9 seconds: there could have been a hundred different locks during that time.

I've left the 'Resource deadlock avoided' as a logger.warning() for added visibility: I would not expect it to be seen in normal usage (but we could make sure we don't print too many of those even in extreme cases)

All(!!!) 3 of the 3 test processes terminated like this:

I'm going to need a little more details to reproduce this -- preferably a test case that runs on CI

It happens with that sigstore loop I am using for testing but now it takes a very long time for this to happen.

@jku
Copy link
Member Author

jku commented Sep 2, 2025

All(!!!) 3 of the 3 test processes terminated like this:

I'm going to need a little more details to reproduce this -- preferably a test case that runs on CI

It happens with that sigstore loop I am using for testing but now it takes a very long time for this to happen.

I'm not that worried if this only happens after obnoxious amounts of unrealistic load that can't be reproduced in CI: Like I said the windows code is currently designed to fail eventually anyway and I think that's fine.

That said, the error you mention seems unexpected, I can have a look... could you provide the full error? The snippet does not yet identify how this happens exactly.

@stefanberger
Copy link

That said, the error you mention seems unexpected, I can have a look... could you provide the full error? The snippet does not yet identify how this happens exactly.

This is the test I am running in 3 python interpreters for I would say like 15 minutes or so:

>>> from sigstore import sign
... while True:
...     sign.TrustedRoot.production()
...

One of them broke as shown below. Currently, the other two tests are still running but have always broken in the same way, even the last one that basically runs 'alone'. I would say this is unlikely a bug from this PR.

<sigstore._internal.trust.TrustedRoot object at 0x000001E05273A6C0>
<sigstore._internal.trust.TrustedRoot object at 0x000001E05278D2C0>
Traceback (most recent call last):
  File "<python-input-4>", line 3, in <module>
    sign.TrustedRoot.production()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "C:\Users\StefanBerger\python-tuf\venv\Lib\site-packages\sigstore\_internal\trust.py", line 357, in production
    return cls.from_tuf(DEFAULT_TUF_URL, offline)
           ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\StefanBerger\python-tuf\venv\Lib\site-packages\sigstore\_internal\trust.py", line 344, in from_tuf
    path = TrustUpdater(url, offline).get_trusted_root_path()
           ~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "C:\Users\StefanBerger\python-tuf\venv\Lib\site-packages\sigstore\_internal\tuf.py", line 116, in __init__
    self._updater = Updater(
                    ~~~~~~~^
        metadata_dir=str(self._metadata_dir),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
        bootstrap=root_json,
        ^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 146, in __init__
    self._persist_root(self._trusted_set.root.version, bootstrap)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 357, in _persist_root
    self._persist_file(str(rootdir / f"{version}.root.json"), data)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 376, in _persist_file
    raise e
  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 369, in _persist_file
    os.replace(temp_file.name, filename)
    ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\StefanBerger\\AppData\\Local\\sigstore\\sigstore-python\\tuf\\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\\tmpgz9qivm6' -> 'C:\\Users\\StefanBerger\\AppData\\Local\\sigstore\\sigstore-python\\tuf\\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\\root_history\\12.root.json'

@jku
Copy link
Member Author

jku commented Sep 2, 2025

thanks. That is definitely a case where

  • we have a lock so no other process should have any metadata files open
  • but we still get permission denied when doing os.replace()

Not amazing but unless we can reproduce that with some reasonable amount of load I wouldn't worry too much -- that file was definitely opened thousands if not tens of thousands of times in that test.

I do wonder if we should try to avoid unnecessary writes in these cases -- the initial root is the same one (in the 99.9% happy path) so we could just compare the content and avoid writing if it's the same...

EDIT: avoiding writes during init is not too hard. It does not mean any less open()s but avoiding writing feels right.

@jku
Copy link
Member Author

jku commented Sep 3, 2025

Status:

  • the "optimization" to avoid writing initial root and the symlink if they are correct already seems useful: I can do that
  • posix and windows currently operate differently: posix will just wait for the lock for as long as it takes, windows will fail after some time if it does not get a lock.
    • I could make posix work like windows (to avoid the possibility of waiting forever)... but that gives us the downsides as well: we would then just make non-blocking checks for "is the lock free now?" and sleep for most of the time (versus letting the OS wake us when the lock is available). The worst case scenario (under unrealistic load) likely looks like what Stefan has described above for windows
  • under enough load at least the windows version will break: I'm not too worried about this, flawless execution at unrealistic loads is not a goal here. Writing/reading can't succeed absolutely every time.
  • I'm still a little worried about bugs though
  • Using filelock from tox-dev is still an option

@jku
Copy link
Member Author

jku commented Sep 7, 2025

posix and windows currently operate differently: posix will just wait for the lock for as long as it takes, windows will fail after some time if it does not get a lock.

  • I could make posix work like windows (to avoid the possibility of waiting forever)... but that gives us the downsides as well: we would then just make non-blocking checks for "is the lock free now?" and sleep for most of the time (versus letting the OS wake us when the lock is available). The worst case scenario (under unrealistic load) likely looks like what Stefan has described above for windows

I think I will at least try the non-blocking version on posix too: this makes the failure mode easier for users as we can print the lock file path in the timeout error message so user can just delete the file if they want (essentially overriding the current lock).

Copy link
Member

@lukpueh lukpueh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @jku!

I didn't find any problems in the lock implementation, i.e. the contextmanagers. The main question is, if we indeed cover all code regions that should be locked. But I think it's fine.

Regarding timeout on POSIX, I probably wouldn't bother. Right now, if an updater process hangs, it won't timeout either.

yield f

except ModuleNotFoundError:
# Windows file locking, in belt-and-suspenders-from-Temu style:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

@contextmanager
def lock_file(path: str) -> Iterator[IO]:
with open(path, "wb") as f:
fcntl.lockf(f, fcntl.LOCK_EX)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to realise that closing the file releases the lock (even after reading fcntl docs). Maybe this is obvious to others. But a brief comment about how this works might be helpful.

for _ in range(100):
try:
with open(path, "wb") as f:
msvcrt.locking(f.fileno(), msvcrt.LK_LOCK, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you lock 1 byte, in an empty file? I guess you can.

# * msvcrt.locking() does not even block until file is available: it just
# tries once per second in a non-blocking manner for 10 seconds. So if
# another process keeps opening the file it's unlikely that we actually
# get the lock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means the timeout could be anything between 10 and 40 seconds? Would it make sense to use LK_NBLCK instead of LK_LOCK to fully control the timeout?

return self._preorder_depth_first_walk(target_path)
with self._lock_metadata():
if Targets.type not in self._trusted_set:
# implicit refresh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not calling refresh, which has its own lock, to include pdf walk in the same lock, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants