[MRG+1] Ngram Performance #7567

jtdoepke · 2016-10-04T01:43:44Z

Reference Issue

What does this implement/fix? Explain your changes.

A couple of small changes to make ngram generation a little bit faster.

Bind the tokens.append() and " ".join() methods outside of the for-loops.
For unigrams, skip slicing entirely and just use tokens = list(original_tokens).

Any other comments?

Benchmarks

* Improve ngram performance by binding methods outside the loop.

* Create unigrams without slicing.

rth

This is great! Particularly since the case of ngram=[1, ..] is frequently used; it would be definitely useful!

rth · 2016-10-04T09:55:41Z

sklearn/feature_extraction/text.py

-            tokens = []
+            if min_n == 1:
+                tokens = list(original_tokens)
+                min_n += 1


Maybe just add a comment here to say that this does the same thing as the fist iteration of the loop below (which is then skipped) as that's not entirely clear from reading it.

rth · 2016-10-04T09:58:11Z

sklearn/feature_extraction/text.py

        min_n, max_n = self.ngram_range
+        if min_n == 1:
+            ngrams = list(text_document)
+            min_n += 1


Same comment as above

* Added code comment to explain using list() for unigrams.

jnothman · 2016-10-04T12:36:09Z

I'm happy with the unigram fast path. I don't think binding methods reduces enough overhead to care. Please benchmark each change (of 1 and 2) separately.

rth · 2016-10-04T12:43:25Z

@jnothman I think they are benchmarked separately in the link above, and method binding improves performance by more than 10-20%, surprisingly...

jtdoepke · 2016-10-04T14:10:37Z

The benchmarks are in the order I committed the changes, so: first test has no changes, the second has only the method binding, then the third has method binding and the unigram fast path.

I'll rerun the unigram fast path benchmark separately later tonight, but the method binding on it's own was 13-20% faster.

NelleV · 2016-10-04T16:33:36Z

I am surprised that the method binding does anything at all. Can you put the actual times and how you've ran the benchmarks?

rth · 2016-10-04T16:41:49Z

@NelleV The link to benchmarks is in the "Any other comments?" section in the original post above.

Carreau · 2016-10-04T18:12:05Z

sklearn/feature_extraction/text.py

            for n in xrange(min_n,
                            min(max_n + 1, n_original_tokens + 1)):
                for i in xrange(n_original_tokens - n + 1):
-                    tokens.append(" ".join(original_tokens[i: i + n]))
+                    tokens_append(space_join(original_tokens[i: i + n]))


as a quick glance I would say that most of the time will be spend on the slicing original_tokens[i: i + n], it might be possible to swap the two loops to get something like (roughly)

for i in ...: _current = '' for _tok in original_tokens[i:i + n]: _current = _current+' '+_tok tokens_append( _current )

Thus only iterating over the slice, which should likely be faster as well.
Would need to be profiled though.

Actually there seem to be an even simpler way, using the following

import itertools def iter_window(seq, n): l = list(seq[:n]) append = l.append for item in itertools.islice(seq, n-1, len(seq)): yield tuple(l) l.pop(0) l.append(item) yield tuple(l)

and then replacing the inner loop by

for n in range(a,b): for _tks in iter_window(original_tokens,n, c-1): tokens_append( space_join( _tks ) )

I get a ~2+ speedup (increasing with lenght of ngrams) and the sliding window implementation can likely be made much more efficient using a dequeue maybe.

jnothman · 2016-10-05T00:22:14Z

Okay, I'm somewhat persuaded. @Carreau we'd probably get more gains in the nested loop speed by rewriting in Cython (which would also avoid binding issues if static typing were used correctly).

In terms of text classification speed, I've also been thinking of creating a wrapper for Spacy which may smoothly enable the creation of lemmatised, lemma + POS, named entity and dependency path features. I'm not yet sure of the ideal interface, and it would be designed as a separate repository for sklearn-contrib. If anyone's interested in coding this up, I'm happy to share my thoughts.

Carreau · 2016-10-05T01:27:54Z

@Carreau we'd probably get more gains in the nested loop speed by rewriting in Cython (which would also avoid binding issues if static typing were used correctly).

Sure, I was just trying to show that being careful about the algorithm implementation could also make a significant boost and that micro-optimisations were not the only solution.

amueller · 2016-10-05T19:20:52Z

@jnothman have you tried pattern? http://www.clips.ua.ac.be/pattern I wasn't convinced by spacy.

amueller · 2016-10-05T19:21:50Z

Also, isn't the right data structure for this a trie #2639?

jnothman · 2016-10-05T21:15:29Z

Trie will fix different problems.

SpaCy happens to have been made by a former colleague, so I might have my
biases. Very valuable in its parsing, not sure about the rest.

On 6 October 2016 at 06:21, Andreas Mueller [email protected]
wrote:

Also, isn't the right data structure for this a trie #2639
#2639?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7567 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61x9yChR9GpYvsceWPQUYTAxBVZPks5qw_jQgaJpZM4KNOMr
.

jnothman · 2016-10-05T23:44:19Z

I think it depends how much weight you put on parsing, really... One
advantage of spaCy is that it naturally produces vector spaces. Although it
doesn't naturally produce vector spaces for conjunctions / n-grams, which
is something I'm a little concerned about.

On 6 October 2016 at 08:15, Joel Nothman [email protected] wrote:

Trie will fix different problems.

SpaCy happens to have been made by a former colleague, so I might have my
biases. Very valuable in its parsing, not sure about the rest.

On 6 October 2016 at 06:21, Andreas Mueller [email protected]
wrote:

Also, isn't the right data structure for this a trie #2639
#2639?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7567 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61x9yChR9GpYvsceWPQUYTAxBVZPks5qw_jQgaJpZM4KNOMr
.

jnothman · 2016-10-05T23:45:05Z

Anyway, @Carreau, a PR is welcome. @jtdoepke, I shall review this properly soon.

jtdoepke · 2016-10-05T23:55:10Z

Here's some additional benchmarks with everything separately, including @Carreau 's suggestions.

jnothman · 2016-10-06T00:08:47Z

Thanks, that's great. I'm not able to invest much thought into this atm,
but wondering what about chars makes them worse for sliding window, unlike
ngrams? If there is such disparity, does that suggest it's dependent on
properties of the documents?

On 6 October 2016 at 10:55, Jaye [email protected] wrote:

Here's https://gist.github.com/jtdoepke/bb19edf8e4678246a15ba48b2c47ce3f
some additional benchmarks with everything separately, including @Carreau
https://github.com/Carreau 's suggestions.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7567 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61FtqIZRVUDfM8koJgT86p08GSquks5qxDjfgaJpZM4KNOMr
.

rth · 2017-01-30T14:19:40Z

The benchmarks linked above by @jtdoepke can be summarized in the following (badly formatted) table,

(the arrows denote the best performing cases)

The "1. method binding" + "2. Unigram shortcut" (the current code in this PR) outperform other proposed method (generators with sliding window etc), for unigram words and character n-grams, but not for word n-grams.

Bottom line is that this shows that there is still room for performance improvement for the n-gram vectoriser either by rewriting some sections in cython or by using generators. However, maybe that would deserve a separate PR?
Meanwhile this PR does exactly what it claims to be doing: a consistent 10-20% speed improvement for word an character n-grams (with a few lines of code)...

@jtdoepke Would you mind renaming this PR to "[MRG] Ngram Performance" to attract attention to it and so it could be added to the backlog or PRs to be reviewed? Thanks.

Update: also related #7107

jnothman

were there not more enticing changes to consider, I'd say this looked good for merge. Thanks for all the benchmarks.

jnothman · 2017-02-23T11:35:27Z

sklearn/feature_extraction/text.py

+
+        # bind method outside of loop to reduce overhead
+        ngrams_append = ngrams.append
+
        for n in xrange(min_n, min(max_n + 1, text_len + 1)):


Could we please get a benchmark on writing this as a list comprehension:

n_grams.extend([text_document[i: i + n]] for n in xrange(min_n, min(max_n + 1, text_len + 1)) for i in xrange(text_len - n + 1))

?

Also benchmark returning a generator i.e. use itertools.chain over generator expressions to see if the counting can benefit from not materialising the list.

Sure. I tried several variations of list comprehension and generators (and a couple other shortcuts). Looks like creating all the ngrams as a single list comprehension is faster than a generator.

lesteve · 2017-03-07T07:43:51Z

I have to admit I am not an expert on this but in order to convince me you will have to use:

a more realistic benchmark. AFAICT your vocabulary is ~350 words long.
use %timeit to do timings, from a quick look at it, it looks like you are implementing your own timeit doing the mean of all the results, which is not the most robust way of doing timings.
profiling and possibly line-profiling to show where significant time is spent to make sure we are optimizing the right lines of code

Ideally I would rather have a benchmark script rather than a notebook.

rth · 2017-06-09T09:57:08Z

I benchmarked this PR anew with non synthetic data (20 newsgroup) and it appears that some of @lesteve 's concerns were justified.

In these benchmaks, this PR has no performance impact at all when ngram_range=(1, 1) but includes an up to 10% speedup for larger ngram ranges, both with words and character ngrams. The full results are accessible here, obtained with this benchmarks script . Profiling
in the case ngram_range=(1, 1) and that of ngram_range=(1, 2) show that the optimized function _word_ngrams accounts for a relatively small fraction (<20%) of the total run time.

Out of different components, or optimizations considered (unigram shrortcut, method binding, list comprehension) the first two are the ones (this PR exactly) that bring the best performance.

Across different vectorizers and options this PR brings a 5% speedup, and the performance is at worse the same as before. So the net effect of this PR is positive, but considering that generally the optimized functions accounts for a relatively small fraction of total run time, it is probably not worth spending time on additional optimizations here.

Therefore I would suggest that this PR be merged in it's current state. What do you think?
@lesteve @jnothman

lesteve · 2017-06-09T12:17:51Z

Thanks a lot @rth for the benchmark, is it worth adding it to the benchmarks folder?

All in all I think this is reasonable to merge. Some small improvement. The code changes are not that controversial.

@jnothman what do you think?

jnothman · 2017-06-10T13:38:29Z

I can't look at it in full now, but I'm +1 for merging optimisations that do not substantially reduce readability (which already isn't great, so it's hard to reduce it a lot)

…

On 9 Jun 2017 10:17 pm, "Loïc Estève" ***@***.***> wrote: Thanks a lot @rth <https://github.com/rth> for the benchmark, is it worth adding it to the benchmarks folder? All in all I think this is reasonable to merge. Some small improvement. The code changes are not that controversial. @jnothman <https://github.com/jnothman> what do you think? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7567 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69ETKfaT9-kB9js9iph0BqIWKCHXks5sCTfwgaJpZM4KNOMr> .

jnothman · 2017-06-18T14:59:32Z

Thanks @jtdoepke. We could add an entry in what's new if you'd like. Please suggest a text and I'll commit it in.

jtdoepke · 2017-06-21T21:07:43Z

Small performance improvement to n-gram creation in
:mod:`sklearn.feature_extraction.text` by binding methods outside of loops
and taking a fast path for unigrams.

jnothman · 2017-06-21T22:24:55Z

thanks

…

On 22 Jun 2017 7:07 am, "Jaye" ***@***.***> wrote: Small performance improvement to n-gram creation in :mod:`sklearn.feature_extraction.text` by binding methods outside of loops and taking a fast path for unigrams. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#7567 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz66epFhX1SYeci0_pdMp8gMwTl6f9ks5sGYYlgaJpZM4KNOMr> .

jtdoepke added 2 commits October 1, 2016 17:53

Improve ngram performance - method binding

874ed30

* Improve ngram performance by binding methods outside the loop.

Improve ngram performance - unigram list

ac62e1c

* Create unigrams without slicing.

rth reviewed Oct 4, 2016

View reviewed changes

Improve ngram performance - code comment

f7c4441

* Added code comment to explain using list() for unigrams.

Carreau reviewed Oct 4, 2016

View reviewed changes

jtdoepke changed the title ~~Ngram Performance~~ [MRG] Ngram Performance Jan 30, 2017

jnothman reviewed Feb 23, 2017

View reviewed changes

lesteve changed the title ~~[MRG] Ngram Performance~~ [MRG+1] Ngram Performance Jun 9, 2017

rth mentioned this pull request Jun 9, 2017

[MRG+1] Add text vectorizers benchmarks #9086

Merged

jnothman approved these changes Jun 18, 2017

View reviewed changes

jnothman merged commit d8e54d9 into scikit-learn:master Jun 18, 2017

jtdoepke deleted the ngram_performance branch June 21, 2017 21:07

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] Ngram Performance (scikit-learn#7567)

60d2b99

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] Ngram Performance (scikit-learn#7567)

633f3cf

NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+1] Ngram Performance (scikit-learn#7567)

8727305

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] Ngram Performance (scikit-learn#7567)

c1d451c

AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG+1] Ngram Performance (scikit-learn#7567)

ba4cea5

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] Ngram Performance (scikit-learn#7567)

2b3904f

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+1] Ngram Performance (scikit-learn#7567)

164e4ba

Uh oh!

[MRG+1] Ngram Performance #7567

[MRG+1] Ngram Performance #7567

Uh oh!

Conversation

jtdoepke commented Oct 4, 2016

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

rth Oct 4, 2016

Choose a reason for hiding this comment

Uh oh!

rth Oct 4, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman commented Oct 4, 2016

Uh oh!

rth commented Oct 4, 2016

Uh oh!

jtdoepke commented Oct 4, 2016

Uh oh!

NelleV commented Oct 4, 2016

Uh oh!

rth commented Oct 4, 2016

Uh oh!

Carreau Oct 4, 2016

Choose a reason for hiding this comment

Uh oh!

Carreau Oct 4, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman commented Oct 5, 2016

Uh oh!

Carreau commented Oct 5, 2016

Uh oh!

amueller commented Oct 5, 2016

Uh oh!

amueller commented Oct 5, 2016

Uh oh!

jnothman commented Oct 5, 2016

Uh oh!

jnothman commented Oct 5, 2016

Uh oh!

jnothman commented Oct 5, 2016

Uh oh!

jtdoepke commented Oct 5, 2016

Uh oh!

jnothman commented Oct 6, 2016

Uh oh!

rth commented Jan 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Feb 23, 2017

Choose a reason for hiding this comment

Uh oh!

jnothman Feb 23, 2017

Choose a reason for hiding this comment

Uh oh!

jtdoepke Mar 7, 2017

Choose a reason for hiding this comment

Uh oh!

lesteve commented Mar 7, 2017

Uh oh!

rth commented Jun 9, 2017

Uh oh!

lesteve commented Jun 9, 2017

Uh oh!

jnothman commented Jun 10, 2017 via email

Uh oh!

jnothman commented Jun 18, 2017

Uh oh!

jtdoepke commented Jun 21, 2017

Uh oh!

rth commented Jan 30, 2017 •

edited

Loading