Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] Ngram Performance #7567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 18, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 32 additions & 6 deletions sklearn/feature_extraction/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,12 +133,24 @@ def _word_ngrams(self, tokens, stop_words=None):
min_n, max_n = self.ngram_range
if max_n != 1:
original_tokens = tokens
tokens = []
if min_n == 1:
# no need to do any slicing for unigrams
# just iterate through the original tokens
tokens = list(original_tokens)
min_n += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just add a comment here to say that this does the same thing as the fist iteration of the loop below (which is then skipped) as that's not entirely clear from reading it.

else:
tokens = []

n_original_tokens = len(original_tokens)

# bind method outside of loop to reduce overhead
tokens_append = tokens.append
space_join = " ".join

for n in xrange(min_n,
min(max_n + 1, n_original_tokens + 1)):
for i in xrange(n_original_tokens - n + 1):
tokens.append(" ".join(original_tokens[i: i + n]))
tokens_append(space_join(original_tokens[i: i + n]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a quick glance I would say that most of the time will be spend on the slicing original_tokens[i: i + n], it might be possible to swap the two loops to get something like (roughly)

for i in ...:
    _current = ''
    for _tok in original_tokens[i:i + n]:
            _current = _current+' '+_tok
            tokens_append(
                _current
            )

Thus only iterating over the slice, which should likely be faster as well.
Would need to be profiled though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there seem to be an even simpler way, using the following

import itertools
def iter_window(seq, n):
    l = list(seq[:n])
    append = l.append
    for item in itertools.islice(seq, n-1, len(seq)):
        yield tuple(l)
        l.pop(0)
        l.append(item)
    yield tuple(l)

and then replacing the inner loop by

for n in range(a,b):
            for _tks in iter_window(original_tokens,n, c-1):
                tokens_append(
                    space_join(
                        _tks
                    )
                )

I get a ~2+ speedup (increasing with lenght of ngrams) and the sliding window implementation can likely be made much more efficient using a dequeue maybe.


return tokens

Expand All @@ -148,11 +160,21 @@ def _char_ngrams(self, text_document):
text_document = self._white_spaces.sub(" ", text_document)

text_len = len(text_document)
ngrams = []
min_n, max_n = self.ngram_range
if min_n == 1:
# no need to do any slicing for unigrams
# iterate through the string
ngrams = list(text_document)
min_n += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

else:
ngrams = []

# bind method outside of loop to reduce overhead
ngrams_append = ngrams.append

for n in xrange(min_n, min(max_n + 1, text_len + 1)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please get a benchmark on writing this as a list comprehension:

n_grams.extend([text_document[i: i + n]]
               for n in xrange(min_n, min(max_n + 1, text_len + 1))
               for i in xrange(text_len - n + 1))

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also benchmark returning a generator i.e. use itertools.chain over generator expressions to see if the counting can benefit from not materialising the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I tried several variations of list comprehension and generators (and a couple other shortcuts). Looks like creating all the ngrams as a single list comprehension is faster than a generator.

for i in xrange(text_len - n + 1):
ngrams.append(text_document[i: i + n])
ngrams_append(text_document[i: i + n])
return ngrams

def _char_wb_ngrams(self, text_document):
Expand All @@ -165,15 +187,19 @@ def _char_wb_ngrams(self, text_document):

min_n, max_n = self.ngram_range
ngrams = []

# bind method outside of loop to reduce overhead
ngrams_append = ngrams.append

for w in text_document.split():
w = ' ' + w + ' '
w_len = len(w)
for n in xrange(min_n, max_n + 1):
offset = 0
ngrams.append(w[offset:offset + n])
ngrams_append(w[offset:offset + n])
while offset + n < w_len:
offset += 1
ngrams.append(w[offset:offset + n])
ngrams_append(w[offset:offset + n])
if offset == 0: # count a short word (w_len < n) only once
break
return ngrams
Expand Down