-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Implement nested four-level TeX cache #24699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following this fix - it seems to just make 8 new directories every time a basefile is asked for. I don't think that will help performance, but maybe I'm not understanding something. How are you testing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the proposed solutions from the issue; it doesn't fix the problem of the cache's size being unbounded, but it solves the other problem of having too many files in the same single directory, which the issue mentions can be a bottleneck.
make_dvi
and the other functions use the path returned byget_basefile
to cache/retrieve the file, so instead of all the files being stored in~/.cache/matplotlib/tex.cache/
(or wherever the cache is specified in the config), it will store them as~/.cache/matplotlib/tex.cache/xx/xx/xx/xx/
, where thex
s are the first 8 letters of the file hash.As for unit tests, I wasn't sure how to go about implementing them because the cache is never cleared, so the
TexManager
module could just be returning cached files from a previous run, instead of creating them (I assume we want to test both creation and retrieval). Would it be alright to use something likeshutil.rmtree
in the tests on the cache directory to force the creation of the files?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - 4 deep is potentially 65k directories, most likely with only one file in it. Do we really want that overhead? Even just two-deep would probably be fine? Thats around 8 million files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I only did four levels based on the suggestion in the issue. To make sure I understand correctly, the 8 million number comes from 16^2 * 32000 (since 32k is where the bottlenecks begin according to the issue discussion)? I will also amend my commit to change to two-levels.
I wrote a test case that checks the path of the returned file from
make_dvi
contains the correct levels. I'm not sure of a good way to "clear" the cache in the test case without messing with multithreading and locks, so I have left it alone for now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert on these caches - hopefully @tacaswell or @anntzer will clarify as I think they are both familiar.
I agree that clearing the cache during a process seems fraught. It might be desirable to touch files as they are used, and then clear old files at the beginning of the process before any race conditions can set up. Not sure if there is a huge performance penalty to touching the cache files, but seems easier than an sql data base to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only (and it looks like that this is the implementation) make folders what there is actually a file in them.
The (main) thing that filesystems have major performance issues with is many files or folders in the same rather than the absolute number of files / folders. This is having 1M files in a single folder is way harder on the file system than 3 layers of 100.
As a reference,
git
does this in.git/objects
to one layer of 2 characters (but the many of the objects are packfiles and I think it makes sure that the total number never gets too big).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was concerned about 16kb of directory metadata for each file group that often only take up <1kb each.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think trying to clean up the cache in the middle of a test run could be pretty nasty (I don't know if our process-level caches can handle that, in fact) and I'd say that we should not require tests here (the fact that the rest of the test suite -- specifically usetex tests -- works at all is good enough).