-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Implement nested four-level TeX cache #24699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @jpjepko - welcome to Matplotlib! Did you mean to keep this in draft, or is this ready for a review? |
Hi @melissawm, thank you for the warm welcome! Yes, I am ready to submit for review, I will mark it as such. |
nested_folders = nested_folders / Path(filehash[i:i+2]) | ||
|
||
filepath = (Path(cls.texcache) / nested_folders) | ||
filepath.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following this fix - it seems to just make 8 new directories every time a basefile is asked for. I don't think that will help performance, but maybe I'm not understanding something. How are you testing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the proposed solutions from the issue; it doesn't fix the problem of the cache's size being unbounded, but it solves the other problem of having too many files in the same single directory, which the issue mentions can be a bottleneck. make_dvi
and the other functions use the path returned by get_basefile
to cache/retrieve the file, so instead of all the files being stored in ~/.cache/matplotlib/tex.cache/
(or wherever the cache is specified in the config), it will store them as ~/.cache/matplotlib/tex.cache/xx/xx/xx/xx/
, where the x
s are the first 8 letters of the file hash.
As for unit tests, I wasn't sure how to go about implementing them because the cache is never cleared, so the TexManager
module could just be returning cached files from a previous run, instead of creating them (I assume we want to test both creation and retrieval). Would it be alright to use something like shutil.rmtree
in the tests on the cache directory to force the creation of the files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - 4 deep is potentially 65k directories, most likely with only one file in it. Do we really want that overhead? Even just two-deep would probably be fine? Thats around 8 million files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I only did four levels based on the suggestion in the issue. To make sure I understand correctly, the 8 million number comes from 16^2 * 32000 (since 32k is where the bottlenecks begin according to the issue discussion)? I will also amend my commit to change to two-levels.
I wrote a test case that checks the path of the returned file from make_dvi
contains the correct levels. I'm not sure of a good way to "clear" the cache in the test case without messing with multithreading and locks, so I have left it alone for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert on these caches - hopefully @tacaswell or @anntzer will clarify as I think they are both familiar.
I agree that clearing the cache during a process seems fraught. It might be desirable to touch files as they are used, and then clear old files at the beginning of the process before any race conditions can set up. Not sure if there is a huge performance penalty to touching the cache files, but seems easier than an sql data base to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only (and it looks like that this is the implementation) make folders what there is actually a file in them.
The (main) thing that filesystems have major performance issues with is many files or folders in the same rather than the absolute number of files / folders. This is having 1M files in a single folder is way harder on the file system than 3 layers of 100.
As a reference, git
does this in .git/objects
to one layer of 2 characters (but the many of the objects are packfiles and I think it makes sure that the total number never gets too big).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was concerned about 16kb of directory metadata for each file group that often only take up <1kb each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think trying to clean up the cache in the middle of a test run could be pretty nasty (I don't know if our process-level caches can handle that, in fact) and I'd say that we should not require tests here (the fact that the rest of the test suite -- specifically usetex tests -- works at all is good enough).
2bb115c
to
c7699a0
Compare
lib/matplotlib/texmanager.py
Outdated
return os.path.join( | ||
cls.texcache, hashlib.md5(src.encode('utf-8')).hexdigest()) | ||
os.path.join(cls.texcache, nested_folders), filehash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should drop redundant characters?
Any amount of this is going to help push the problem out, I am not too worried about the details between 2 level and 4 levels unless we have bench marks that show one is much worse. |
c7699a0
to
a948056
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a mild preference for dropping the leading characters that are used in the path but not enough to block merging over.
Prevents putting too many files in a single folder
a948056
to
631b286
Compare
Thanks @jpjepko! Congratulations on your first PR to Matplotlib 🎉 We hope to hear from you again. |
This PR is a fix for issue #23779. I decided to try implementing the first suggested solution: to use a nested 2-letter 4-level folder hierarchy. This addresses the problem of putting too many files in a single directory, but still has unbounded space.
PR Summary
PR Checklist
Documentation and Tests
pytest
passes)Release Notes
.. versionadded::
directive in the docstring and documented indoc/users/next_whats_new/
.. versionchanged::
directive in the docstring and documented indoc/api/next_api_changes/
next_whats_new/README.rst
ornext_api_changes/README.rst