-
Notifications
You must be signed in to change notification settings - Fork 76
Description
Prework
- I understand and agree to help guide.
- I understand and agree to contributing guide.
- New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.
Problem
The current file rehashing policy is coded in file_should_rehash()
:
Lines 58 to 63 in c64a7e8
file_should_rehash <- function(file, time, size, bytes) { | |
small <- bytes < file_small_bytes | |
touched <- !identical(time, file$time) | |
resized <- !identical(size, file$size) | |
small || touched || resized | |
} |
In particular, small files are always rehashed which is a bottleneck for pipelines with large numbers of small files. Because time stamps have low resolution on e.g. Windows, they are only trusted when the file is large.
Proposal
For small files in _targets/objects/
which the user should not modify by hand, I propose we try to avoid rehashing them. We might just compare the modification time to Sys.time()
and trust the time stamp if it is older than a second. We could reduce this threshold on non-Windows machines. Would be good to revisit the actual timestamp resolution on various platforms.
I plan to keep the existing policy for format = "file"
because those files are controlled by the user and it is harder to make the required simplifying assumptions for more nuanced cache invalidation.