Cache kpsewhich results persistently #10236

jkseppan · 2018-01-12T14:17:11Z

And allow batching them. This commit does not yet use the batching
but makes it possible.

PR Summary

See #4880 for discussion. This is a cleaned-up version of part of my suggested solution; the other part will involve more intricate parsing of dvi and vf files to call kpsewhich in batches, but already the persistent caching here should improve performance on repeated runs.

PR Checklist

Has Pytest style unit tests
Code is PEP 8 compliant
New features are documented, with examples if plot related
Documentation is sphinx and numpydoc compliant
Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

jklymak · 2018-01-12T17:02:45Z

EDIT: not sure this is significantly faster - I think I adequately "cleared" the cache by changing the scale over which x varies. Ran w/ True a second time to get the cached value...

Without this PR on master:

False 0.3574392795562744
True 2.6049678325653076
True 0.07756495475769043

With this patch:

False 0.355482816696167
True 2.334202289581299
True 0.08783102035522461

Test code is:

import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import time
x = np.linspace(0, 200, 500)
y = 3 * x + 0.5 * x**3 + 2

def plot(x, y):
    plt.plot(x, y)
    plt.title('simple testplot')
    plt.xlabel('$x$')
    plt.ylabel(r'$3 x^5 + \frac{1}{2} x^3 + 2$')
    plt.savefig('test.pdf')


for use in [False, True, True]:
    matplotlib.rc('text', usetex=use)
    t0 = time.time()
    plot(x, y)
    print('%r'%use, time.time()-t0)

anntzer · 2018-01-12T19:16:19Z

Is there a specific reason to use sqlite instead of e.g. json? Just curious...

jkseppan · 2018-01-12T19:35:02Z

A sqlite database is easier to extend, and faster to modify especially when it gets larger. I'd like to store things like pre-parsed font files in the cache.

anntzer · 2018-01-29T02:49:05Z

lib/matplotlib/dviread.py

-    Find a file in the texmf tree.
+
+    __slots__ = ('connection')
+    schema_version = 1


This will cause the db to be repeatedly rebuilt if the use has venvs with different python versions (fontList.json currently suffers from the same problem).
Perhaps better would be to use something like ~/.cache/texsupport.$version.db. This way you also don't need to handle db migration.

That seems like a good idea.

jkseppan · 2018-02-16T15:45:06Z

Rebased on top of current master. Incorporates @anntzer's suggestion to include the version in the cache filename.

And allow batching them. This commit does not yet use the batching but makes it possible.

- synchronous=normal (fewer disk writes, still safe in WAL mode) - foreign key enforcement - log sql statements at debug level - use sqlite3.Row (enables accessing columns by name)

jkseppan · 2018-02-18T15:45:27Z

A more complete answer to the question of sqlite vs json: PR#10268 (on top of this one) implements caching parsed dvi files in the sqlite database (SQL definition). The btree indexes allow fast retrieval of and iteration over file contents, and the sqlite layer takes care of all concurrency. Deleting individual files from the cache (not implemented in that PR yet) is easy, just delete the entry for the file id and the foreign key constraints take care of the rest. Reading and writing large JSON files is pretty slow as there is a lot of parsing, while sqlite stores data in an efficient binary format.

anntzer · 2018-02-18T17:07:24Z

A quick check shows that (on my machine) loading fontList.json takes ~3.5ms (incl. ~0.3ms for just reading the raw contents of the file), which should be contrasted with importing matplotlib.pyplot with qt (~450ms), matplotlib.pyplot with agg (~400ms), just matplotlib (~200ms) or just numpy (~100ms). So from the POV of speed it's just ~1% (I'm not too concerned about writing as it's mostly a one-off operation and likely dominated with actually computing the data anyways).
It doesn't mean we shouldn't switch to sqlite, rather I just wanted to provide an additional data point for the discussion.

jkseppan · 2018-02-18T17:20:29Z

Just to be clear, I'm not suggesting switching fontList.json to sqlite, I want to make usetex faster with the pdf backend. kpsewhich gets called a lot during dvi parsing and each time an external process is spawned takes nontrivial time (reputedly, on some platforms) so caching those results is a starting point. Now for those results a json file would probably work well enough, but for the data stored in #10268 (pre-parsed dvi files) a binary format is going to be better. I'm building this as a sequence of PRs (this one, #10238, #10268) to keep the diffs manageable for code review.

tacaswell · 2018-07-09T19:41:18Z

Moving to 3.1, this should get lots of use before we release it.

anntzer · 2018-10-04T12:16:24Z

I feel like we should investigate a bit more what makes kpsewhich so slow on OSX before adding more cache layers on our side.
Calling kpsewhich --debug=-1 ... gives a bit more output on what it does, can someone report what this looks like on (a slow) OSX?

jkseppan · 2018-10-06T08:06:16Z

Doesn't look like this is likely to go in, so better to close.

jkseppan mentioned this pull request Jan 12, 2018

LaTeX rendering is really slow #4880

Closed

This was referenced Jan 12, 2018

Call kpsewhich with more arguments at one time #10238

Closed

Dvi caching #10268

Closed

jkseppan added this to the v3.0 milestone Jan 19, 2018

jkseppan added topic: text topic: text/usetex and removed topic: text labels Jan 19, 2018

anntzer reviewed Jan 29, 2018

View reviewed changes

jkseppan force-pushed the kpsewhich-caching branch from e987a58 to 1ef6c98 Compare February 16, 2018 15:43

jkseppan force-pushed the kpsewhich-caching branch 2 times, most recently from a107b02 to bb2c85f Compare February 16, 2018 17:14

jkseppan added 3 commits February 18, 2018 14:55

Cache kpsewhich results persistently

645c8da

And allow batching them. This commit does not yet use the batching but makes it possible.

Include next_whats_new/* again

2124ac8

Enable some sqlite and pysqlite options

3ce3061

- synchronous=normal (fewer disk writes, still safe in WAL mode) - foreign key enforcement - log sql statements at debug level - use sqlite3.Row (enables accessing columns by name)

jkseppan force-pushed the kpsewhich-caching branch from 5866d94 to 3ce3061 Compare February 18, 2018 12:55

tacaswell modified the milestones: v3.0, v3.1 Jul 9, 2018

jkseppan closed this Oct 6, 2018

anntzer mentioned this pull request Feb 18, 2021

Reuse single kpsewhich instance for speed. #19531

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Cache kpsewhich results persistently #10236

Cache kpsewhich results persistently #10236

Uh oh!

jkseppan commented Jan 12, 2018 •

edited

Loading

Uh oh!

jklymak commented Jan 12, 2018 •

edited

Loading

Uh oh!

anntzer commented Jan 12, 2018

Uh oh!

jkseppan commented Jan 12, 2018

Uh oh!

anntzer Jan 29, 2018

Uh oh!

jkseppan Jan 30, 2018

Uh oh!

jkseppan commented Feb 16, 2018

Uh oh!

jkseppan commented Feb 18, 2018

Uh oh!

anntzer commented Feb 18, 2018

Uh oh!

jkseppan commented Feb 18, 2018

Uh oh!

tacaswell commented Jul 9, 2018

Uh oh!

anntzer commented Oct 4, 2018

Uh oh!

jkseppan commented Oct 6, 2018

Uh oh!

Uh oh!

Uh oh!

Cache kpsewhich results persistently #10236

Cache kpsewhich results persistently #10236

Uh oh!

Conversation

jkseppan commented Jan 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

PR Checklist

Uh oh!

jklymak commented Jan 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anntzer commented Jan 12, 2018

Uh oh!

jkseppan commented Jan 12, 2018

Uh oh!

anntzer Jan 29, 2018

Choose a reason for hiding this comment

Uh oh!

jkseppan Jan 30, 2018

Choose a reason for hiding this comment

Uh oh!

jkseppan commented Feb 16, 2018

Uh oh!

jkseppan commented Feb 18, 2018

Uh oh!

anntzer commented Feb 18, 2018

Uh oh!

jkseppan commented Feb 18, 2018

Uh oh!

tacaswell commented Jul 9, 2018

Uh oh!

anntzer commented Oct 4, 2018

Uh oh!

jkseppan commented Oct 6, 2018

Uh oh!

Uh oh!

jkseppan commented Jan 12, 2018 •

edited

Loading

jklymak commented Jan 12, 2018 •

edited

Loading