Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Cache kpsewhich results persistently #10236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

jkseppan
Copy link
Member

@jkseppan jkseppan commented Jan 12, 2018

And allow batching them. This commit does not yet use the batching
but makes it possible.

PR Summary

See #4880 for discussion. This is a cleaned-up version of part of my suggested solution; the other part will involve more intricate parsing of dvi and vf files to call kpsewhich in batches, but already the persistent caching here should improve performance on repeated runs.

PR Checklist

  • Has Pytest style unit tests
  • Code is PEP 8 compliant
  • New features are documented, with examples if plot related
  • Documentation is sphinx and numpydoc compliant
  • Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
  • Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

@jklymak
Copy link
Member

jklymak commented Jan 12, 2018

EDIT: not sure this is significantly faster - I think I adequately "cleared" the cache by changing the scale over which x varies. Ran w/ True a second time to get the cached value...

Without this PR on master:

False 0.3574392795562744
True 2.6049678325653076
True 0.07756495475769043

With this patch:

False 0.355482816696167
True 2.334202289581299
True 0.08783102035522461

Test code is:

import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import time
x = np.linspace(0, 200, 500)
y = 3 * x + 0.5 * x**3 + 2

def plot(x, y):
    plt.plot(x, y)
    plt.title('simple testplot')
    plt.xlabel('$x$')
    plt.ylabel(r'$3 x^5 + \frac{1}{2} x^3 + 2$')
    plt.savefig('test.pdf')


for use in [False, True, True]:
    matplotlib.rc('text', usetex=use)
    t0 = time.time()
    plot(x, y)
    print('%r'%use, time.time()-t0)

@anntzer
Copy link
Contributor

anntzer commented Jan 12, 2018

Is there a specific reason to use sqlite instead of e.g. json? Just curious...

@jkseppan
Copy link
Member Author

A sqlite database is easier to extend, and faster to modify especially when it gets larger. I'd like to store things like pre-parsed font files in the cache.

Find a file in the texmf tree.

__slots__ = ('connection')
schema_version = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause the db to be repeatedly rebuilt if the use has venvs with different python versions (fontList.json currently suffers from the same problem).
Perhaps better would be to use something like ~/.cache/texsupport.$version.db. This way you also don't need to handle db migration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good idea.

@jkseppan
Copy link
Member Author

Rebased on top of current master. Incorporates @anntzer's suggestion to include the version in the cache filename.

@jkseppan jkseppan force-pushed the kpsewhich-caching branch 2 times, most recently from a107b02 to bb2c85f Compare February 16, 2018 17:14
And allow batching them. This commit does not yet use the batching
but makes it possible.
- synchronous=normal (fewer disk writes, still safe in WAL mode)
- foreign key enforcement
- log sql statements at debug level
- use sqlite3.Row (enables accessing columns by name)
@jkseppan
Copy link
Member Author

A more complete answer to the question of sqlite vs json: PR#10268 (on top of this one) implements caching parsed dvi files in the sqlite database (SQL definition). The btree indexes allow fast retrieval of and iteration over file contents, and the sqlite layer takes care of all concurrency. Deleting individual files from the cache (not implemented in that PR yet) is easy, just delete the entry for the file id and the foreign key constraints take care of the rest. Reading and writing large JSON files is pretty slow as there is a lot of parsing, while sqlite stores data in an efficient binary format.

@anntzer
Copy link
Contributor

anntzer commented Feb 18, 2018

A quick check shows that (on my machine) loading fontList.json takes ~3.5ms (incl. ~0.3ms for just reading the raw contents of the file), which should be contrasted with importing matplotlib.pyplot with qt (~450ms), matplotlib.pyplot with agg (~400ms), just matplotlib (~200ms) or just numpy (~100ms). So from the POV of speed it's just ~1% (I'm not too concerned about writing as it's mostly a one-off operation and likely dominated with actually computing the data anyways).
It doesn't mean we shouldn't switch to sqlite, rather I just wanted to provide an additional data point for the discussion.

@jkseppan
Copy link
Member Author

Just to be clear, I'm not suggesting switching fontList.json to sqlite, I want to make usetex faster with the pdf backend. kpsewhich gets called a lot during dvi parsing and each time an external process is spawned takes nontrivial time (reputedly, on some platforms) so caching those results is a starting point. Now for those results a json file would probably work well enough, but for the data stored in #10268 (pre-parsed dvi files) a binary format is going to be better. I'm building this as a sequence of PRs (this one, #10238, #10268) to keep the diffs manageable for code review.

@tacaswell tacaswell modified the milestones: v3.0, v3.1 Jul 9, 2018
@tacaswell
Copy link
Member

Moving to 3.1, this should get lots of use before we release it.

@anntzer
Copy link
Contributor

anntzer commented Oct 4, 2018

I feel like we should investigate a bit more what makes kpsewhich so slow on OSX before adding more cache layers on our side.
Calling kpsewhich --debug=-1 ... gives a bit more output on what it does, can someone report what this looks like on (a slow) OSX?

@jkseppan
Copy link
Member Author

jkseppan commented Oct 6, 2018

Doesn't look like this is likely to go in, so better to close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants