-
Notifications
You must be signed in to change notification settings - Fork 158
Default / lazy parser not cached #253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
note: I'm leaving the comment for posterity and the information in it is technically true, but it's a reply to a complete misunderstanding of the report, so you can skip it if you're facing the same issue, the comments afterwards are the investigation "proper", and the fix has been released as part of 1.0.1.
For what that's worth1 only the basic python parser is cached by default2: on my benchmarks while a (sufficiently large) cache does improve the performance of the native parsers I considered the additional memory not really worth it given what little effect it had on what I considered a real-world benchmark. Since you're looping on a single user agent string, it makes sense that the one cached parser would be faster than the two natives, as it's basically an ideal situation for a cache (once the UA has been parsed once future calls are just a cache hit), however the difference in scale you report is significantly larger than I would have expected, and so is the difference between re2 and regex. I'll have to see how ipython interacts with things when I'm able. In the meantime, could you provide details for the machine you're running things on? (e.g. CPU model / architecture) I've mostly been benching on my dev machine (might have been smart to have some benches I could run on GHA, I should do that). And could you try the bench script provided with ua-parser? Most of my benchmarking has been done using https://raw.githubusercontent.com/ua-parser/uap-python/refs/heads/master/samples/useragents.txt so I'd be grateful if you could download that file and run
Then maybe try to run a pared down configuration on id.txt (that's the same user agent repeated 100000 times):
note that it might take a fairly long time (possibly minutes) Footnotes
|
For context, my real-world use-case is here: https://github.com/Lookyloo/lookyloo/blob/main/lookyloo/helpers.py#L389 Either way, it's not really relevant in this context, and the amount of UAs I need to parse doesn't require anything highly performant, but I have long running processes so I'm happy to use the rust parser, if I can make sure it is loaded only once. And I think it is the way to go (?): import ua_parser
base = ua_parser.regex.Resolver(ua_parser.loaders.load_lazy_builtins())
ua_parser.parser = ua_parser.Parser(base)
ua_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
%timeit ua_parser.parse(ua_string).with_defaults() => CPU on my machine: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz / x86_64 Benchmarks (on my machine)
|
Oooh I see now, I completely misunderstood your report (probably because mentions of cache kinda prime me with how much time I spent on that), I'm dreadfully sorry, you're talking about caching the parser itself after lazily instantiating it! At a glance, it looks like I broke the parser memoization in #230: I forgot to keep the assignment of the parser onto the Thanks for the report! And sorry again for the misunderstanding.
That looks about right, yes. There should actually be a |
Do not worry at all! I should have explained my issue better. For now, I'll just use the default parser (as it is not blocking), but will go back to the rust one as soon as the bug is fixed: as it's on long running processes, it makes sense to have one long(ish) initialization, and quick parsing. |
Reported by @Rafiot: the lazy parser is not memoised, this has limited effect on the basic / pure Python parser as its initialisation is trivial, but it *significantly* impact the re2 and regex parsers as they need to process regexes into a filter tree. The memoization was mistakenly removed in ua-parser#230: while refactoring initialisation I removed the setting of the `parser` global. - add a test to ensure the parser is correctly memoized, not re-instantiated every time - reinstate setting the global - add a mutex on `__getattr__`, it should only be used on first access and avoids two threads creating an expensive parser at the same time (which is a waste of CPU) Fixes ua-parser#253
FWIW I just published 1.0.1 which should fix the issue: using your test case (kinda I'm just using timeit at the CLI), on 1.0.0 I get:
Which does track with your observations at least in terms of scaling. With 1.0.1 off of pypi,
And thanks yet again for the report, and sorry for the trouble. |
Excellent, thank you very much, it works as expected! |
I'm guessing I'm doing something wrong, because it makes very little sense when reading the doc but in short, on my machine, the slowest parser is
ua-parser-rs
.That's what I'm doing in ipython (but I have the same results just with the python interpreter, it's just easier to time it):
ua-parser-rs
:734 ms ± 27.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
google-re2
188 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
ua-parser-rs
andgoogle-re2
1.07 ms ± 33.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
I'm on ubuntu 24.10 with Python 3.12.7.
Do you have any idea what's going on there? Something related to the cache being reloaded on every call? I couldn't find a way to avoid that.
The text was updated successfully, but these errors were encountered: