Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[ENH]: control the size of the tex cache #23779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tacaswell opened this issue Aug 30, 2022 · 2 comments
Closed

[ENH]: control the size of the tex cache #23779

tacaswell opened this issue Aug 30, 2022 · 2 comments
Labels
Difficulty: Medium https://matplotlib.org/devdocs/devel/contribute.html#good-first-issues Good first issue Open a pull request against these issues if there are no active ones! New feature
Milestone

Comments

@tacaswell
Copy link
Member

Problem

We keep a cache of the .tex and .dvi files when rendering with an external LaTeX process, however we put no controls on the size of that cache. We have anecdotal reports (#4880 (comment)) that if this cache gets too big it becomes its own bottle neck (I assume the problem is we have put too many files in a single folder for the file system).

There are two hard problems in computer science

  1. naming thing
  2. cache invalidation
  3. off-by-one bugs

Proposed solution

  1. The files are already doing content-based addressing. One solution is to go with the nested folder approach (like git does internally) where the first 8 characters become 4 levels of 2 letter named folders (or whatever tree width / depth make sense).
    • Pro: it will avoid any filesystem related slow down due to too many files in a directory, no to state or API (no, it does not need to be configurable)!
    • Con: still unbounded space
  2. set a maximum number or diskspace (or both) that can be used and then cull the files by some algorithm (random? oldest on disk? do we want track enough to do LRU or LFU?)
    • Pro: solves the unbounded cache problem!
    • Con: we will have to add some API to control this, maybe track some extra state, and we might be opening up a whole new vector for inter-process race conditions (process A: "I need to clean up!" process B: "oh, I need that file!" process A: "deletes that file" process B: 💥 )

I am labeling this as a good first issue because while there may be some new API it should be well contained to how we manage a cache (and it is a cache so we should already be robust to it going away under us) but medium difficulty because this will require thinking through the consequences of the caching algorithm and would be best done by someone who has at least worked with (and preferably implemented / maintained) a similar on-disk caching system.

@tacaswell tacaswell added New feature Difficulty: Medium https://matplotlib.org/devdocs/devel/contribute.html#good-first-issues Good first issue Open a pull request against these issues if there are no active ones! labels Aug 30, 2022
@tacaswell tacaswell added this to the v3.7.0 milestone Aug 30, 2022
@anntzer
Copy link
Contributor

anntzer commented Aug 31, 2022

FWIW http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html suggests not to go beyond 32k entries per directory.

Another option to consider may be to move the cache into a sqlite db (possibly directly storing a parsed version of the dvi file), which is something @jkseppan has argued for IIRC.

@QuLogic
Copy link
Member

QuLogic commented Jan 24, 2023

Option 1 was implemented in #24699.

@QuLogic QuLogic closed this as completed Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Difficulty: Medium https://matplotlib.org/devdocs/devel/contribute.html#good-first-issues Good first issue Open a pull request against these issues if there are no active ones! New feature
Projects
None yet
Development

No branches or pull requests

3 participants