Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Implement nested four-level TeX cache #24699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 22, 2022

Conversation

jpjepko
Copy link
Contributor

@jpjepko jpjepko commented Dec 12, 2022

This PR is a fix for issue #23779. I decided to try implementing the first suggested solution: to use a nested 2-letter 4-level folder hierarchy. This addresses the problem of putting too many files in a single directory, but still has unbounded space.

PR Summary

PR Checklist

Documentation and Tests

  • Has pytest style unit tests (and pytest passes)
  • Documentation is sphinx and numpydoc compliant (the docs should build without error).
  • New plotting related features are documented with examples.

Release Notes

  • New features are marked with a .. versionadded:: directive in the docstring and documented in doc/users/next_whats_new/
  • API changes are marked with a .. versionchanged:: directive in the docstring and documented in doc/api/next_api_changes/
  • Release notes conform with instructions in next_whats_new/README.rst or next_api_changes/README.rst

@melissawm
Copy link
Member

Hi @jpjepko - welcome to Matplotlib! Did you mean to keep this in draft, or is this ready for a review?

@jpjepko
Copy link
Contributor Author

jpjepko commented Dec 12, 2022

Hi @melissawm, thank you for the warm welcome! Yes, I am ready to submit for review, I will mark it as such.

@jpjepko jpjepko marked this pull request as ready for review December 12, 2022 18:35
nested_folders = nested_folders / Path(filehash[i:i+2])

filepath = (Path(cls.texcache) / nested_folders)
filepath.mkdir(parents=True, exist_ok=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following this fix - it seems to just make 8 new directories every time a basefile is asked for. I don't think that will help performance, but maybe I'm not understanding something. How are you testing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the proposed solutions from the issue; it doesn't fix the problem of the cache's size being unbounded, but it solves the other problem of having too many files in the same single directory, which the issue mentions can be a bottleneck. make_dvi and the other functions use the path returned by get_basefile to cache/retrieve the file, so instead of all the files being stored in ~/.cache/matplotlib/tex.cache/ (or wherever the cache is specified in the config), it will store them as ~/.cache/matplotlib/tex.cache/xx/xx/xx/xx/, where the xs are the first 8 letters of the file hash.

As for unit tests, I wasn't sure how to go about implementing them because the cache is never cleared, so the TexManager module could just be returning cached files from a previous run, instead of creating them (I assume we want to test both creation and retrieval). Would it be alright to use something like shutil.rmtree in the tests on the cache directory to force the creation of the files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - 4 deep is potentially 65k directories, most likely with only one file in it. Do we really want that overhead? Even just two-deep would probably be fine? Thats around 8 million files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I only did four levels based on the suggestion in the issue. To make sure I understand correctly, the 8 million number comes from 16^2 * 32000 (since 32k is where the bottlenecks begin according to the issue discussion)? I will also amend my commit to change to two-levels.

I wrote a test case that checks the path of the returned file from make_dvi contains the correct levels. I'm not sure of a good way to "clear" the cache in the test case without messing with multithreading and locks, so I have left it alone for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on these caches - hopefully @tacaswell or @anntzer will clarify as I think they are both familiar.

I agree that clearing the cache during a process seems fraught. It might be desirable to touch files as they are used, and then clear old files at the beginning of the process before any race conditions can set up. Not sure if there is a huge performance penalty to touching the cache files, but seems easier than an sql data base to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only (and it looks like that this is the implementation) make folders what there is actually a file in them.

The (main) thing that filesystems have major performance issues with is many files or folders in the same rather than the absolute number of files / folders. This is having 1M files in a single folder is way harder on the file system than 3 layers of 100.

As a reference, git does this in .git/objects to one layer of 2 characters (but the many of the objects are packfiles and I think it makes sure that the total number never gets too big).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was concerned about 16kb of directory metadata for each file group that often only take up <1kb each.

Copy link
Contributor

@anntzer anntzer Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think trying to clean up the cache in the middle of a test run could be pretty nasty (I don't know if our process-level caches can handle that, in fact) and I'd say that we should not require tests here (the fact that the rest of the test suite -- specifically usetex tests -- works at all is good enough).

@jpjepko jpjepko force-pushed the texcache-four-levels branch from 2bb115c to c7699a0 Compare December 12, 2022 22:05
return os.path.join(
cls.texcache, hashlib.md5(src.encode('utf-8')).hexdigest())
os.path.join(cls.texcache, nested_folders), filehash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should drop redundant characters?

@tacaswell
Copy link
Member

Any amount of this is going to help push the problem out, I am not too worried about the details between 2 level and 4 levels unless we have bench marks that show one is much worse.

@jpjepko jpjepko force-pushed the texcache-four-levels branch from c7699a0 to a948056 Compare December 13, 2022 02:24
@tacaswell tacaswell added this to the v3.7.0 milestone Dec 13, 2022
Copy link
Member

@tacaswell tacaswell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a mild preference for dropping the leading characters that are used in the path but not enough to block merging over.

@QuLogic QuLogic changed the title Implement nested four-level cache Implement nested four-level TeX cache Dec 14, 2022
Prevents putting too many files in a single folder
@jpjepko jpjepko force-pushed the texcache-four-levels branch from a948056 to 631b286 Compare December 22, 2022 22:02
@QuLogic QuLogic merged commit 65dc9be into matplotlib:main Dec 22, 2022
@QuLogic
Copy link
Member

QuLogic commented Dec 22, 2022

Thanks @jpjepko! Congratulations on your first PR to Matplotlib 🎉 We hope to hear from you again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

7 participants