Implement nested four-level TeX cache #24699

jpjepko · 2022-12-12T07:32:49Z

This PR is a fix for issue #23779. I decided to try implementing the first suggested solution: to use a nested 2-letter 4-level folder hierarchy. This addresses the problem of putting too many files in a single directory, but still has unbounded space.

PR Summary

PR Checklist

Documentation and Tests

Has pytest style unit tests (and pytest passes)
Documentation is sphinx and numpydoc compliant (the docs should build without error).
New plotting related features are documented with examples.

Release Notes

New features are marked with a .. versionadded:: directive in the docstring and documented in doc/users/next_whats_new/
API changes are marked with a .. versionchanged:: directive in the docstring and documented in doc/api/next_api_changes/
Release notes conform with instructions in next_whats_new/README.rst or next_api_changes/README.rst

melissawm · 2022-12-12T17:28:25Z

Hi @jpjepko - welcome to Matplotlib! Did you mean to keep this in draft, or is this ready for a review?

jpjepko · 2022-12-12T18:35:26Z

Hi @melissawm, thank you for the warm welcome! Yes, I am ready to submit for review, I will mark it as such.

jklymak · 2022-12-12T18:40:04Z

lib/matplotlib/texmanager.py

+            nested_folders = nested_folders / Path(filehash[i:i+2])
+
+        filepath = (Path(cls.texcache) / nested_folders)
+        filepath.mkdir(parents=True, exist_ok=True)


I'm not following this fix - it seems to just make 8 new directories every time a basefile is asked for. I don't think that will help performance, but maybe I'm not understanding something. How are you testing this?

This is one of the proposed solutions from the issue; it doesn't fix the problem of the cache's size being unbounded, but it solves the other problem of having too many files in the same single directory, which the issue mentions can be a bottleneck. make_dvi and the other functions use the path returned by get_basefile to cache/retrieve the file, so instead of all the files being stored in ~/.cache/matplotlib/tex.cache/ (or wherever the cache is specified in the config), it will store them as ~/.cache/matplotlib/tex.cache/xx/xx/xx/xx/, where the xs are the first 8 letters of the file hash.

As for unit tests, I wasn't sure how to go about implementing them because the cache is never cleared, so the TexManager module could just be returning cached files from a previous run, instead of creating them (I assume we want to test both creation and retrieval). Would it be alright to use something like shutil.rmtree in the tests on the cache directory to force the creation of the files?

OK - 4 deep is potentially 65k directories, most likely with only one file in it. Do we really want that overhead? Even just two-deep would probably be fine? Thats around 8 million files.

I see. I only did four levels based on the suggestion in the issue. To make sure I understand correctly, the 8 million number comes from 16^2 * 32000 (since 32k is where the bottlenecks begin according to the issue discussion)? I will also amend my commit to change to two-levels.

I wrote a test case that checks the path of the returned file from make_dvi contains the correct levels. I'm not sure of a good way to "clear" the cache in the test case without messing with multithreading and locks, so I have left it alone for now.

I'm not an expert on these caches - hopefully @tacaswell or @anntzer will clarify as I think they are both familiar.

I agree that clearing the cache during a process seems fraught. It might be desirable to touch files as they are used, and then clear old files at the beginning of the process before any race conditions can set up. Not sure if there is a huge performance penalty to touching the cache files, but seems easier than an sql data base to me.

We should only (and it looks like that this is the implementation) make folders what there is actually a file in them.

The (main) thing that filesystems have major performance issues with is many files or folders in the same rather than the absolute number of files / folders. This is having 1M files in a single folder is way harder on the file system than 3 layers of 100.

As a reference, git does this in .git/objects to one layer of 2 characters (but the many of the objects are packfiles and I think it makes sure that the total number never gets too big).

I guess I was concerned about 16kb of directory metadata for each file group that often only take up <1kb each.

I think trying to clean up the cache in the middle of a test run could be pretty nasty (I don't know if our process-level caches can handle that, in fact) and I'd say that we should not require tests here (the fact that the rest of the test suite -- specifically usetex tests -- works at all is good enough).

tacaswell · 2022-12-12T22:40:17Z

lib/matplotlib/texmanager.py

        return os.path.join(
-            cls.texcache, hashlib.md5(src.encode('utf-8')).hexdigest())
+               os.path.join(cls.texcache, nested_folders), filehash)


We should drop redundant characters?

tacaswell · 2022-12-12T22:41:59Z

Any amount of this is going to help push the problem out, I am not too worried about the details between 2 level and 4 levels unless we have bench marks that show one is much worse.

tacaswell

I have a mild preference for dropping the leading characters that are used in the path but not enough to block merging over.

lib/matplotlib/texmanager.py

Prevents putting too many files in a single folder

QuLogic · 2022-12-22T23:51:42Z

Thanks @jpjepko! Congratulations on your first PR to Matplotlib 🎉 We hope to hear from you again.

oscargus added topic: text/usetex Performance labels Dec 12, 2022

jpjepko marked this pull request as ready for review December 12, 2022 18:35

jklymak reviewed Dec 12, 2022

View reviewed changes

jpjepko force-pushed the texcache-four-levels branch from 2bb115c to c7699a0 Compare December 12, 2022 22:05

tacaswell reviewed Dec 12, 2022

View reviewed changes

jpjepko force-pushed the texcache-four-levels branch from c7699a0 to a948056 Compare December 13, 2022 02:24

tacaswell added this to the v3.7.0 milestone Dec 13, 2022

tacaswell approved these changes Dec 13, 2022

View reviewed changes

QuLogic changed the title ~~Implement nested four-level cache~~ Implement nested four-level TeX cache Dec 14, 2022

QuLogic reviewed Dec 14, 2022

View reviewed changes

lib/matplotlib/texmanager.py Outdated Show resolved Hide resolved

lib/matplotlib/texmanager.py Outdated Show resolved Hide resolved

Implement nested four-level cache

631b286

Prevents putting too many files in a single folder

jpjepko force-pushed the texcache-four-levels branch from a948056 to 631b286 Compare December 22, 2022 22:02

QuLogic approved these changes Dec 22, 2022

View reviewed changes

QuLogic merged commit 65dc9be into matplotlib:main Dec 22, 2022

QuLogic mentioned this pull request Jan 24, 2023

[ENH]: control the size of the tex cache #23779

Closed

Uh oh!

Implement nested four-level TeX cache #24699

Implement nested four-level TeX cache #24699

Uh oh!

Conversation

jpjepko commented Dec 12, 2022

PR Summary

PR Checklist

Uh oh!

melissawm commented Dec 12, 2022

Uh oh!

jpjepko commented Dec 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anntzer Dec 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tacaswell commented Dec 12, 2022

Uh oh!

tacaswell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

QuLogic commented Dec 22, 2022

Uh oh!

Uh oh!

anntzer Dec 13, 2022 •

edited

Loading