-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Cache kpsewhich results persistently #10236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
EDIT: not sure this is significantly faster - I think I adequately "cleared" the cache by changing the scale over which x varies. Ran w/ True a second time to get the cached value... Without this PR on master:
With this patch:
Test code is: import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import time
x = np.linspace(0, 200, 500)
y = 3 * x + 0.5 * x**3 + 2
def plot(x, y):
plt.plot(x, y)
plt.title('simple testplot')
plt.xlabel('$x$')
plt.ylabel(r'$3 x^5 + \frac{1}{2} x^3 + 2$')
plt.savefig('test.pdf')
for use in [False, True, True]:
matplotlib.rc('text', usetex=use)
t0 = time.time()
plot(x, y)
print('%r'%use, time.time()-t0) |
Is there a specific reason to use sqlite instead of e.g. json? Just curious... |
A sqlite database is easier to extend, and faster to modify especially when it gets larger. I'd like to store things like pre-parsed font files in the cache. |
lib/matplotlib/dviread.py
Outdated
Find a file in the texmf tree. | ||
|
||
__slots__ = ('connection') | ||
schema_version = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause the db to be repeatedly rebuilt if the use has venvs with different python versions (fontList.json currently suffers from the same problem).
Perhaps better would be to use something like ~/.cache/texsupport.$version.db. This way you also don't need to handle db migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a good idea.
e987a58
to
1ef6c98
Compare
Rebased on top of current master. Incorporates @anntzer's suggestion to include the version in the cache filename. |
a107b02
to
bb2c85f
Compare
And allow batching them. This commit does not yet use the batching but makes it possible.
- synchronous=normal (fewer disk writes, still safe in WAL mode) - foreign key enforcement - log sql statements at debug level - use sqlite3.Row (enables accessing columns by name)
5866d94
to
3ce3061
Compare
A more complete answer to the question of sqlite vs json: PR#10268 (on top of this one) implements caching parsed dvi files in the sqlite database (SQL definition). The btree indexes allow fast retrieval of and iteration over file contents, and the sqlite layer takes care of all concurrency. Deleting individual files from the cache (not implemented in that PR yet) is easy, just delete the entry for the file id and the foreign key constraints take care of the rest. Reading and writing large JSON files is pretty slow as there is a lot of parsing, while sqlite stores data in an efficient binary format. |
A quick check shows that (on my machine) loading fontList.json takes ~3.5ms (incl. ~0.3ms for just reading the raw contents of the file), which should be contrasted with importing matplotlib.pyplot with qt (~450ms), matplotlib.pyplot with agg (~400ms), just matplotlib (~200ms) or just numpy (~100ms). So from the POV of speed it's just ~1% (I'm not too concerned about writing as it's mostly a one-off operation and likely dominated with actually computing the data anyways). |
Just to be clear, I'm not suggesting switching fontList.json to sqlite, I want to make usetex faster with the pdf backend. kpsewhich gets called a lot during dvi parsing and each time an external process is spawned takes nontrivial time (reputedly, on some platforms) so caching those results is a starting point. Now for those results a json file would probably work well enough, but for the data stored in #10268 (pre-parsed dvi files) a binary format is going to be better. I'm building this as a sequence of PRs (this one, #10238, #10268) to keep the diffs manageable for code review. |
Moving to 3.1, this should get lots of use before we release it. |
I feel like we should investigate a bit more what makes kpsewhich so slow on OSX before adding more cache layers on our side. |
Doesn't look like this is likely to go in, so better to close. |
And allow batching them. This commit does not yet use the batching
but makes it possible.
PR Summary
See #4880 for discussion. This is a cleaned-up version of part of my suggested solution; the other part will involve more intricate parsing of dvi and vf files to call kpsewhich in batches, but already the persistent caching here should improve performance on repeated runs.
PR Checklist