Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Revisit file rehashing policy #1062

@wlandau

Description

@wlandau

Prework

  • I understand and agree to help guide.
  • I understand and agree to contributing guide.
  • New features take time and effort to create, and they take even more effort to maintain. So if the purpose of the feature is to resolve a struggle you are encountering personally, please consider first posting a "trouble" or "other" issue so we can discuss your use case and search for existing solutions first.

Problem

The current file rehashing policy is coded in file_should_rehash():

targets/R/class_file.R

Lines 58 to 63 in c64a7e8

file_should_rehash <- function(file, time, size, bytes) {
small <- bytes < file_small_bytes
touched <- !identical(time, file$time)
resized <- !identical(size, file$size)
small || touched || resized
}

In particular, small files are always rehashed which is a bottleneck for pipelines with large numbers of small files. Because time stamps have low resolution on e.g. Windows, they are only trusted when the file is large.

Proposal

For small files in _targets/objects/ which the user should not modify by hand, I propose we try to avoid rehashing them. We might just compare the modification time to Sys.time() and trust the time stamp if it is older than a second. We could reduce this threshold on non-Windows machines. Would be good to revisit the actual timestamp resolution on various platforms.

I plan to keep the existing policy for format = "file" because those files are controlled by the user and it is harder to make the required simplifying assumptions for more nuanced cache invalidation.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions