Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] Text vectorizers memory usage improvement (v2) #7272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 15, 2016

Conversation

rth
Copy link
Member

@rth rth commented Aug 28, 2016

This PR benchmarks and slightly refactors previously proposed improvements for performance and memory usage of text vectorizers,

The memory optimisation of the HashingVecotorizer from PR #5122 could be done in a separate PR.

The benchmark notebook uses the Enron email collection and considers the performance and memory usage impact of these changes on CountVectorizer, TfidfVectorizer and HashingVectorizer for different number of documents (Dataset size),

tmp1
tmp2
tmp3
tmp4

Note that the benchmarks may not be fully reliable (significant I/O operations for reading the data from disk) and sometimes produce different timing for identical runs.

Memory wise there is indeed a significant improvement,

  • CountVectorizer: 30-50% less memory used
  • TfidfVecotorizer: ~25% less memory used
  • HashingVectorizer: mostly no impact

however there are some performance drawbacks,

  • > 50k document collection: ~~~10% slower~~
  • 10k document collection: up to 2-3 times slower

I'm not sure if the trade-off is worth it, this is definitely oriented to large scale document collections, but one way or the other this should allow to address those PR & opened issue.

Edit: as of Sept 7, 2016 I'm not able to reproduce the performance deterioration anymore (cf comments below)

@jnothman
Copy link
Member

Thanks for the thorough benchmarking. I'd like to look at this soon. At the moment we're working towards a release so this may slip into the next one as non-critical and potentially contentious. So if it's quiet here, please ping in a couple of weeks / once 0.18 is out.

@rth
Copy link
Member Author

rth commented Aug 29, 2016

No problem, I'll ping you about this in a few weeks. Thanks.

@ogrisel ogrisel added this to the 1.0 milestone Aug 29, 2016
map_index[old_val] = new_val

# swap columns in place
indices = X.indices
Copy link
Member

@jnothman jnothman Sep 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think

x.indices = map_index.take(X.indices, mode='clip')

might be identical to the following and faster.

@rth rth changed the title [MRG] Text vectorizers memory usage improvement (v2) [WIP] Text vectorizers memory usage improvement (v2) Sep 8, 2016
@rth rth changed the title [WIP] Text vectorizers memory usage improvement (v2) [MRG] Text vectorizers memory usage improvement (v2) Sep 8, 2016
@rth
Copy link
Member Author

rth commented Sep 8, 2016

I have rerun the benchmarks following modifications after @jnothman's review, and while they show the expected memory usage improvements, I was not able to reproduce the previously reported performance deterioration (including for the previous commit). I'm using the same scikit-learn git version, laptop and conda virtualenv with dependencies: not sure what happened. Maybe these benchmarks are not fully reliable.

Anyway now it looks like this PR allows to decrease memory use in text vectorizers with approximately identical compute time.

@jnothman
Copy link
Member

jnothman commented Sep 8, 2016

I am sure we can improve this further, but I think this is great if we want
something merged even for 0.18 (@ogrisel? @amueller?).

One concern I have with the dicts is how this is affected by document
length. I suspect for short documents, this will be relatively much slower.
Could you please run that test (i.e. just take a portion of each doc)?

On 9 September 2016 at 02:04, Roman Yurchak [email protected]
wrote:

I have rerun the benchmarks
https://gist.github.com/rth/1fba8e88d2d1f3cd3ef49b0d88a22c57 following
modifications after @jnothman https://github.com/jnothman's review, and
while they show the expected memory usage improvements, I was not able to
reproduce the previously reported performance deterioration (including for
the previous commit). I'm using the same scikit-learn git version, laptop
and conda virtualenv with dependencies: not sure what happened. Maybe these
benchmarks are not fully reliable.

Anyway now it looks like this PR allows to decrease memory use in text
vectorizers with approximately identical compute time.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7272 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz67JEBd9eqCDc0G-wGrgJMGq-RgOvks5qoDIJgaJpZM4Ju99s
.

@rth
Copy link
Member Author

rth commented Sep 9, 2016

@jnothman As you requested, I have re-cheched the performance with short documents (same collection truncated to 200 char long documents, with the average doc. length otherwise being 2700 char), and did not see any performance deterioration with this PR, although the memory improvements were less significant.

As far as I can tell, looking at the profiling results (same notebook at the very end) for TfidfVectorizer run on 10k documents with default parameters, 90% of time is spent inside the CountVectorizer._count_vocab function, out of which

  • ~50% is spent in the analyzer & tokenizer (mostly doing re.findall)
  • the other ~50% spent on actually counting in _count_vocab

Other optimizations that I have tried for this PR but that didn't not bring any improvement, are

  • other counting method proposed in this SO post (using Counter, defaultdict etc.)
  • using the pyre2.findall instead of re.findall (but the default regexp is probably too simple to get any improvement there.

I would be quite tempted to see if porting the second (inner) loop of CountVectorizer._count_vocab to Cython would perform any better (but probably in some future PR)...

@amueller
Copy link
Member

amueller commented Sep 9, 2016

I'd be happy with merging this for 0.18. 👍

@jnothman jnothman changed the title [MRG] Text vectorizers memory usage improvement (v2) [MRG+1] Text vectorizers memory usage improvement (v2) Sep 10, 2016
@jnothman
Copy link
Member

Ideally, I'd like to see the benchmarks reproduced on other platforms, but this LGTM otherwise. (@amueller, was yours a LGTM?)

Faster tokenization I think we may need to leave as someone else's problem. The inner loop may even be made faster by using specialised dicts like the IntFloatDict in utils. Alternatively, I wonder whether the current sparse matrix and sum_duplicates approach works out faster if done in small batches rather than all at once.

@nelson-liu
Copy link
Contributor

What are the specs/platform of the machine you're running this on? Id be happy to test this out, looks useful.

@jnothman
Copy link
Member

I haven't tested this at all personally.

On 11 September 2016 at 01:06, Nelson Liu [email protected] wrote:

What are the specs/platform of the machine you're running this on? Id be
happy to test this out, looks useful.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7272 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yCDS545zbJ3b4FWOILR8VYCE0g2ks5qosd6gaJpZM4Ju99s
.

@jnothman
Copy link
Member

but see the ipynb posted above

On 11 September 2016 at 01:24, Joel Nothman [email protected] wrote:

I haven't tested this at all personally.

On 11 September 2016 at 01:06, Nelson Liu [email protected]
wrote:

What are the specs/platform of the machine you're running this on? Id be
happy to test this out, looks useful.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7272 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yCDS545zbJ3b4FWOILR8VYCE0g2ks5qosd6gaJpZM4Ju99s
.

@rth
Copy link
Member Author

rth commented Sep 12, 2016

@jnothman Thanks for the review! Interesting, the IntFloatDict looks promising..

@nelson-liu If you want to re-run this benchmark on your machine that would be great! The notebooks is linked in this comment above. I was running them on Linux (Gentoo), with conda using Python 3.5 and the latest versions of numpy/scipy, etc. On Windows without a bash shell, you may have to manually download the training dataset and the switching between different commits in the benchmark notebook may not work (not sure). My specs are Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz with an SSD, but that doesn't really matter as long as we are looking at relative performance before/after this PR. Thanks!

@amueller
Copy link
Member

yeah lgtm.

@jnothman
Copy link
Member

I'd still love someone to replicate the benchmarks.

@jnothman
Copy link
Member

Sorry i haven't given this the time.

@nelson-liu
Copy link
Contributor

nelson-liu commented Sep 13, 2016

i'm in the process of running this on my machine. This is a very well done benchmarking notebook, thanks @rth !

@nelson-liu
Copy link
Contributor

Here are the results i'm getting

non truncated documents:

screen shot 2016-09-13 at 1 36 57 pm

screen shot 2016-09-13 at 1 37 04 pm

screen shot 2016-09-13 at 1 37 11 pm

screen shot 2016-09-13 at 1 37 17 pm

documents truncated to 200 char length:

screen shot 2016-09-13 at 1 38 35 pm

screen shot 2016-09-13 at 1 38 45 pm

screen shot 2016-09-13 at 1 38 50 pm

screen shot 2016-09-13 at 1 38 55 pm

Seems to be in line with what was reported earlier, with only slight (if any) performance degradation. This LGTM and should be a good enhancement 👍

@amueller
Copy link
Member

sweet

@jnothman
Copy link
Member

The tricky part is knowing who to credit in what's new!

@rth
Copy link
Member Author

rth commented Sep 15, 2016

@jnothman yes, not simple, as these changes come from 2-3 PR and one issue. Congrats for the RC2 BTW!

@jnothman jnothman merged commit 941921f into scikit-learn:master Sep 15, 2016
@jnothman
Copy link
Member

jnothman commented Sep 15, 2016

@ogrisel, @amueller, should we squeeze this into 0.18? Proposed what's new entry under Enhancements:

   - :ref:`Text vectorizers <text_feature_extraction>` now consume less
     memory when applied to large document collections
     (`#7272 <https://github.com/scikit-learn/scikit-learn/pull/7272>_`).
     By `Jochen Wersdörfer <https://github.com/ephes>_`,
     `Roman Yurchak <https://github.com/rth>_`,
     `Roy Blankman <https://github.com/aupiff>_`.

@amueller
Copy link
Member

I'm +1. @ogrisel ?

@rth rth deleted the slim_count_vectorizer_v4 branch September 17, 2016 12:25
amueller pushed a commit that referenced this pull request Sep 25, 2016
TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants