-
Notifications
You must be signed in to change notification settings - Fork 32
Index text snippets in regexes to speed up user agent parsing #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
21c1455
to
4fbb467
Compare
…ll of them Also switch to re2 instead of boost::regex, for better performance. All in all, this amounts to the user agent parsing taking about a tenth of the time.
4fbb467
to
83a2d7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the performance win is really nice! I left a first round of comments, mostly around using more modern C++. I also have a few high level questions:
- What would be the win by just switching to re2 (if you can measure it easily)?
- Have you tried to cache the results from this library as an alternative to make it faster? Back when I needed this for my use case, caching worked very well - the number of distinct user agent strings wasn't that high and hits were in the 99%+ range.
Also, if you could run tests with ASAN / TSAN to make sure it's still clean, that'd be helpful.
a320c84
to
95e33d8
Compare
95e33d8
to
f08052a
Compare
Thank you for your feedback! Just switching to re2 reduces the processing time by about a half, once again using actual production user agent strings. So I guess the performance gain from indexing the snippets is about times five. Although, it should also be somewhat more scalable, since adding more expressions won't affect the processing time as much. I agree that caching the results would quickly cover most of the cases, but we wanted something that was stateless and immediately available. With caching, you also have issues with how to handle thread safety efficiently, whereas here the structure is read-only. I've run ASan, UBSan and TSan on it (both before and after the changes from your feedback), and there were no issues there. |
Looks good, thanks! Not blocking this PR, but could you put together a benchmark? I'd like to protect this performance win for the future. |
This PR contains changes to address performance issues in the user agent parsing, and stems from a development effort at Dailymotion. The external interface has not changed, and it calling it should give the same results, approximately ten times faster (tested with a set of production traffic user agent strings).
The code introduced by this PR is mainly contained within the
internal/
directory, to avoid adding too much clutter to the root directory.There are two main changes:
(?!)
operator), which is not used in the regexes.yaml file (only internally in the uap-cpp code, where it is easily replaced by astd::string::find()
call).A few changes were also made to the
Makefile
, to integrate it more easily into our build system.