-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] Ngram Performance #7567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Improve ngram performance by binding methods outside the loop.
* Create unigrams without slicing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Particularly since the case of ngram=[1, ..]
is frequently used; it would be definitely useful!
tokens = [] | ||
if min_n == 1: | ||
tokens = list(original_tokens) | ||
min_n += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just add a comment here to say that this does the same thing as the fist iteration of the loop below (which is then skipped) as that's not entirely clear from reading it.
min_n, max_n = self.ngram_range | ||
if min_n == 1: | ||
ngrams = list(text_document) | ||
min_n += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above
* Added code comment to explain using list() for unigrams.
I'm happy with the unigram fast path. I don't think binding methods reduces enough overhead to care. Please benchmark each change (of 1 and 2) separately. |
@jnothman I think they are benchmarked separately in the link above, and method binding improves performance by more than 10-20%, surprisingly... |
The benchmarks are in the order I committed the changes, so: first test has no changes, the second has only the method binding, then the third has method binding and the unigram fast path. I'll rerun the unigram fast path benchmark separately later tonight, but the method binding on it's own was 13-20% faster. |
I am surprised that the method binding does anything at all. Can you put the actual times and how you've ran the benchmarks? |
@NelleV The link to benchmarks is in the "Any other comments?" section in the original post above. |
for n in xrange(min_n, | ||
min(max_n + 1, n_original_tokens + 1)): | ||
for i in xrange(n_original_tokens - n + 1): | ||
tokens.append(" ".join(original_tokens[i: i + n])) | ||
tokens_append(space_join(original_tokens[i: i + n])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a quick glance I would say that most of the time will be spend on the slicing original_tokens[i: i + n]
, it might be possible to swap the two loops to get something like (roughly)
for i in ...:
_current = ''
for _tok in original_tokens[i:i + n]:
_current = _current+' '+_tok
tokens_append(
_current
)
Thus only iterating over the slice, which should likely be faster as well.
Would need to be profiled though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there seem to be an even simpler way, using the following
import itertools
def iter_window(seq, n):
l = list(seq[:n])
append = l.append
for item in itertools.islice(seq, n-1, len(seq)):
yield tuple(l)
l.pop(0)
l.append(item)
yield tuple(l)
and then replacing the inner loop by
for n in range(a,b):
for _tks in iter_window(original_tokens,n, c-1):
tokens_append(
space_join(
_tks
)
)
I get a ~2+ speedup (increasing with lenght of ngrams) and the sliding window implementation can likely be made much more efficient using a dequeue maybe.
Okay, I'm somewhat persuaded. @Carreau we'd probably get more gains in the nested loop speed by rewriting in Cython (which would also avoid binding issues if static typing were used correctly). In terms of text classification speed, I've also been thinking of creating a wrapper for Spacy which may smoothly enable the creation of lemmatised, lemma + POS, named entity and dependency path features. I'm not yet sure of the ideal interface, and it would be designed as a separate repository for sklearn-contrib. If anyone's interested in coding this up, I'm happy to share my thoughts. |
Sure, I was just trying to show that being careful about the algorithm implementation could also make a significant boost and that micro-optimisations were not the only solution. |
@jnothman have you tried pattern? http://www.clips.ua.ac.be/pattern I wasn't convinced by spacy. |
Also, isn't the right data structure for this a trie #2639? |
Trie will fix different problems. SpaCy happens to have been made by a former colleague, so I might have my On 6 October 2016 at 06:21, Andreas Mueller [email protected]
|
I think it depends how much weight you put on parsing, really... One On 6 October 2016 at 08:15, Joel Nothman [email protected] wrote:
|
Thanks, that's great. I'm not able to invest much thought into this atm, On 6 October 2016 at 10:55, Jaye [email protected] wrote:
|
The benchmarks linked above by @jtdoepke can be summarized in the following (badly formatted) table, The "1. method binding" + "2. Unigram shortcut" (the current code in this PR) outperform other proposed method (generators with sliding window etc), for unigram words and character n-grams, but not for word n-grams. Bottom line is that this shows that there is still room for performance improvement for the n-gram vectoriser either by rewriting some sections in cython or by using generators. However, maybe that would deserve a separate PR? @jtdoepke Would you mind renaming this PR to "[MRG] Ngram Performance" to attract attention to it and so it could be added to the backlog or PRs to be reviewed? Thanks. Update: also related #7107 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
were there not more enticing changes to consider, I'd say this looked good for merge. Thanks for all the benchmarks.
|
||
# bind method outside of loop to reduce overhead | ||
ngrams_append = ngrams.append | ||
|
||
for n in xrange(min_n, min(max_n + 1, text_len + 1)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please get a benchmark on writing this as a list comprehension:
n_grams.extend([text_document[i: i + n]]
for n in xrange(min_n, min(max_n + 1, text_len + 1))
for i in xrange(text_len - n + 1))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also benchmark returning a generator i.e. use itertools.chain
over generator expressions to see if the counting can benefit from not materialising the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I tried several variations of list comprehension and generators (and a couple other shortcuts). Looks like creating all the ngrams as a single list comprehension is faster than a generator.
I have to admit I am not an expert on this but in order to convince me you will have to use:
Ideally I would rather have a benchmark script rather than a notebook. |
I benchmarked this PR anew with non synthetic data (20 newsgroup) and it appears that some of @lesteve 's concerns were justified. In these benchmaks, this PR has no performance impact at all when Out of different components, or optimizations considered (unigram shrortcut, method binding, list comprehension) the first two are the ones (this PR exactly) that bring the best performance. Across different vectorizers and options this PR brings a 5% speedup, and the performance is at worse the same as before. So the net effect of this PR is positive, but considering that generally the optimized functions accounts for a relatively small fraction of total run time, it is probably not worth spending time on additional optimizations here. Therefore I would suggest that this PR be merged in it's current state. What do you think? |
I can't look at it in full now, but I'm +1 for merging optimisations that
do not substantially reduce readability (which already isn't great, so it's
hard to reduce it a lot)
…On 9 Jun 2017 10:17 pm, "Loïc Estève" ***@***.***> wrote:
Thanks a lot @rth <https://github.com/rth> for the benchmark, is it worth
adding it to the benchmarks folder?
All in all I think this is reasonable to merge. Some small improvement.
The code changes are not that controversial.
@jnothman <https://github.com/jnothman> what do you think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69ETKfaT9-kB9js9iph0BqIWKCHXks5sCTfwgaJpZM4KNOMr>
.
|
Thanks @jtdoepke. We could add an entry in what's new if you'd like. Please suggest a text and I'll commit it in. |
|
thanks
…On 22 Jun 2017 7:07 am, "Jaye" ***@***.***> wrote:
Small performance improvement to n-gram creation in
:mod:`sklearn.feature_extraction.text` by binding methods outside of loops
and taking a fast path for unigrams.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#7567 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz66epFhX1SYeci0_pdMp8gMwTl6f9ks5sGYYlgaJpZM4KNOMr>
.
|
Reference Issue
What does this implement/fix? Explain your changes.
A couple of small changes to make ngram generation a little bit faster.
tokens.append()
and" ".join()
methods outside of the for-loops.tokens = list(original_tokens)
.Any other comments?
Benchmarks