Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Index text snippets in regexes to speed up user agent parsing #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Oct 4, 2019

Conversation

DailyMats
Copy link
Contributor

This PR contains changes to address performance issues in the user agent parsing, and stems from a development effort at Dailymotion. The external interface has not changed, and it calling it should give the same results, approximately ten times faster (tested with a set of production traffic user agent strings).

The code introduced by this PR is mainly contained within the internal/ directory, to avoid adding too much clutter to the root directory.

There are two main changes:

  • Instead of going through every regex, build an index of text snippets that need to be present for a regex to match, and only call the regexes where this requirement is fulfilled for the given input string. For more details, see internal/README.md.
  • Use re2 instead of boost::regex, for better performance. The only thing missing from re2 is negative lookahead (the (?!) operator), which is not used in the regexes.yaml file (only internally in the uap-cpp code, where it is easily replaced by a std::string::find() call).

A few changes were also made to the Makefile, to integrate it more easily into our build system.

@DailyMats DailyMats force-pushed the index-snippets branch 2 times, most recently from 21c1455 to 4fbb467 Compare October 1, 2019 15:51
…ll of them

Also switch to re2 instead of boost::regex, for better performance.

All in all, this amounts to the user agent parsing taking about a tenth of the time.
Copy link
Collaborator

@asuhan asuhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the performance win is really nice! I left a first round of comments, mostly around using more modern C++. I also have a few high level questions:

  1. What would be the win by just switching to re2 (if you can measure it easily)?
  2. Have you tried to cache the results from this library as an alternative to make it faster? Back when I needed this for my use case, caching worked very well - the number of distinct user agent strings wasn't that high and hits were in the 99%+ range.

Also, if you could run tests with ASAN / TSAN to make sure it's still clean, that'd be helpful.

@DailyMats DailyMats force-pushed the index-snippets branch 2 times, most recently from a320c84 to 95e33d8 Compare October 3, 2019 08:39
@DailyMats
Copy link
Contributor Author

Thank you for your feedback!

Just switching to re2 reduces the processing time by about a half, once again using actual production user agent strings. So I guess the performance gain from indexing the snippets is about times five. Although, it should also be somewhat more scalable, since adding more expressions won't affect the processing time as much.

I agree that caching the results would quickly cover most of the cases, but we wanted something that was stateless and immediately available. With caching, you also have issues with how to handle thread safety efficiently, whereas here the structure is read-only.

I've run ASan, UBSan and TSan on it (both before and after the changes from your feedback), and there were no issues there.

@asuhan
Copy link
Collaborator

asuhan commented Oct 4, 2019

Looks good, thanks!

Not blocking this PR, but could you put together a benchmark? I'd like to protect this performance win for the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants