Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Implement nested four-level TeX cache #24699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 22, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions lib/matplotlib/texmanager.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,8 +176,15 @@ def get_basefile(cls, tex, fontsize, dpi=None):
Return a filename based on a hash of the string, fontsize, and dpi.
"""
src = cls._get_tex_source(tex, fontsize) + str(dpi)
return os.path.join(
cls.texcache, hashlib.md5(src.encode('utf-8')).hexdigest())
filehash = hashlib.md5(src.encode('utf-8')).hexdigest()
filepath = Path(cls.texcache)

num_letters, num_levels = 2, 2
for i in range(0, num_letters*num_levels, num_letters):
filepath = filepath / Path(filehash[i:i+2])

filepath.mkdir(parents=True, exist_ok=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following this fix - it seems to just make 8 new directories every time a basefile is asked for. I don't think that will help performance, but maybe I'm not understanding something. How are you testing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the proposed solutions from the issue; it doesn't fix the problem of the cache's size being unbounded, but it solves the other problem of having too many files in the same single directory, which the issue mentions can be a bottleneck. make_dvi and the other functions use the path returned by get_basefile to cache/retrieve the file, so instead of all the files being stored in ~/.cache/matplotlib/tex.cache/ (or wherever the cache is specified in the config), it will store them as ~/.cache/matplotlib/tex.cache/xx/xx/xx/xx/, where the xs are the first 8 letters of the file hash.

As for unit tests, I wasn't sure how to go about implementing them because the cache is never cleared, so the TexManager module could just be returning cached files from a previous run, instead of creating them (I assume we want to test both creation and retrieval). Would it be alright to use something like shutil.rmtree in the tests on the cache directory to force the creation of the files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - 4 deep is potentially 65k directories, most likely with only one file in it. Do we really want that overhead? Even just two-deep would probably be fine? Thats around 8 million files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I only did four levels based on the suggestion in the issue. To make sure I understand correctly, the 8 million number comes from 16^2 * 32000 (since 32k is where the bottlenecks begin according to the issue discussion)? I will also amend my commit to change to two-levels.

I wrote a test case that checks the path of the returned file from make_dvi contains the correct levels. I'm not sure of a good way to "clear" the cache in the test case without messing with multithreading and locks, so I have left it alone for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on these caches - hopefully @tacaswell or @anntzer will clarify as I think they are both familiar.

I agree that clearing the cache during a process seems fraught. It might be desirable to touch files as they are used, and then clear old files at the beginning of the process before any race conditions can set up. Not sure if there is a huge performance penalty to touching the cache files, but seems easier than an sql data base to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only (and it looks like that this is the implementation) make folders what there is actually a file in them.

The (main) thing that filesystems have major performance issues with is many files or folders in the same rather than the absolute number of files / folders. This is having 1M files in a single folder is way harder on the file system than 3 layers of 100.

As a reference, git does this in .git/objects to one layer of 2 characters (but the many of the objects are packfiles and I think it makes sure that the total number never gets too big).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was concerned about 16kb of directory metadata for each file group that often only take up <1kb each.

Copy link
Contributor

@anntzer anntzer Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think trying to clean up the cache in the middle of a test run could be pretty nasty (I don't know if our process-level caches can handle that, in fact) and I'd say that we should not require tests here (the fact that the rest of the test suite -- specifically usetex tests -- works at all is good enough).

return os.path.join(filepath, filehash)

@classmethod
def get_font_preamble(cls):
Expand Down