[MRG+1] Text vectorizers memory usage improvement (v2) #7272

rth · 2016-08-28T17:08:13Z

This PR benchmarks and slightly refactors previously proposed improvements for performance and memory usage of text vectorizers,

implements a refactored version of PR [WIP] Text vectorizers memory usage #5122 (excluding the HashingVectorizer)
Benchmarks but does not include PR more memory-efficient word count calculation #4968 (as it shows worse performance). Note that PR [WIP] Text vectorizers memory usage #5122 is itself is based on PR more memory-efficient word count calculation #4968.
Fixes issue CountVectorizer Performance Problem #5306

The memory optimisation of the HashingVecotorizer from PR #5122 could be done in a separate PR.

The benchmark notebook uses the Enron email collection and considers the performance and memory usage impact of these changes on CountVectorizer, TfidfVectorizer and HashingVectorizer for different number of documents (Dataset size),

Note that the benchmarks may not be fully reliable (significant I/O operations for reading the data from disk) and sometimes produce different timing for identical runs.

Memory wise there is indeed a significant improvement,

CountVectorizer: 30-50% less memory used
TfidfVecotorizer: ~25% less memory used
HashingVectorizer: mostly no impact

~~however there are some performance drawbacks~~,

> 50k document collection: ~~~10% slower~~
10k document collection: ~~up to 2-3 times slower~~

~~I'm not sure if the trade-off is worth it,~~ this is definitely oriented to large scale document collections, but one way or the other this should allow to address those PR & opened issue.

Edit: as of Sept 7, 2016 I'm not able to reproduce the performance deterioration anymore (cf comments below)

This update corresponds to the FastAupiffCountVectorizer implementation from PR scikit-learn#5122

jnothman · 2016-08-28T22:14:47Z

Thanks for the thorough benchmarking. I'd like to look at this soon. At the moment we're working towards a release so this may slip into the next one as non-critical and potentially contentious. So if it's quiet here, please ping in a couple of weeks / once 0.18 is out.

rth · 2016-08-29T08:59:14Z

No problem, I'll ping you about this in a few weeks. Thanks.

jnothman · 2016-09-08T14:35:47Z

sklearn/feature_extraction/text.py

+            map_index[old_val] = new_val
+
+        # swap columns in place
+        indices = X.indices


I think

x.indices = map_index.take(X.indices, mode='clip')

might be identical to the following and faster.

rth · 2016-09-08T16:04:24Z

I have rerun the benchmarks following modifications after @jnothman's review, and while they show the expected memory usage improvements, I was not able to reproduce the previously reported performance deterioration (including for the previous commit). I'm using the same scikit-learn git version, laptop and conda virtualenv with dependencies: not sure what happened. Maybe these benchmarks are not fully reliable.

Anyway now it looks like this PR allows to decrease memory use in text vectorizers with approximately identical compute time.

jnothman · 2016-09-08T22:35:31Z

I am sure we can improve this further, but I think this is great if we want
something merged even for 0.18 (@ogrisel? @amueller?).

One concern I have with the dicts is how this is affected by document
length. I suspect for short documents, this will be relatively much slower.
Could you please run that test (i.e. just take a portion of each doc)?

On 9 September 2016 at 02:04, Roman Yurchak [email protected]
wrote:

I have rerun the benchmarks
https://gist.github.com/rth/1fba8e88d2d1f3cd3ef49b0d88a22c57 following
modifications after @jnothman https://github.com/jnothman's review, and
while they show the expected memory usage improvements, I was not able to
reproduce the previously reported performance deterioration (including for
the previous commit). I'm using the same scikit-learn git version, laptop
and conda virtualenv with dependencies: not sure what happened. Maybe these
benchmarks are not fully reliable.

Anyway now it looks like this PR allows to decrease memory use in text
vectorizers with approximately identical compute time.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7272 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz67JEBd9eqCDc0G-wGrgJMGq-RgOvks5qoDIJgaJpZM4Ju99s
.

rth · 2016-09-09T15:59:14Z

@jnothman As you requested, I have re-cheched the performance with short documents (same collection truncated to 200 char long documents, with the average doc. length otherwise being 2700 char), and did not see any performance deterioration with this PR, although the memory improvements were less significant.

As far as I can tell, looking at the profiling results (same notebook at the very end) for TfidfVectorizer run on 10k documents with default parameters, 90% of time is spent inside the CountVectorizer._count_vocab function, out of which

~50% is spent in the analyzer & tokenizer (mostly doing re.findall)
the other ~50% spent on actually counting in _count_vocab

Other optimizations that I have tried for this PR but that didn't not bring any improvement, are

other counting method proposed in this SO post (using Counter, defaultdict etc.)
using the pyre2.findall instead of re.findall (but the default regexp is probably too simple to get any improvement there.

I would be quite tempted to see if porting the second (inner) loop of CountVectorizer._count_vocab to Cython would perform any better (but probably in some future PR)...

amueller · 2016-09-09T19:32:48Z

I'd be happy with merging this for 0.18. 👍

jnothman · 2016-09-10T15:02:25Z

Ideally, I'd like to see the benchmarks reproduced on other platforms, but this LGTM otherwise. (@amueller, was yours a LGTM?)

Faster tokenization I think we may need to leave as someone else's problem. The inner loop may even be made faster by using specialised dicts like the IntFloatDict in utils. Alternatively, I wonder whether the current sparse matrix and sum_duplicates approach works out faster if done in small batches rather than all at once.

nelson-liu · 2016-09-10T15:06:32Z

What are the specs/platform of the machine you're running this on? Id be happy to test this out, looks useful.

jnothman · 2016-09-10T15:24:30Z

I haven't tested this at all personally.

On 11 September 2016 at 01:06, Nelson Liu [email protected] wrote:

What are the specs/platform of the machine you're running this on? Id be
happy to test this out, looks useful.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7272 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yCDS545zbJ3b4FWOILR8VYCE0g2ks5qosd6gaJpZM4Ju99s
.

jnothman · 2016-09-10T15:24:50Z

but see the ipynb posted above

On 11 September 2016 at 01:24, Joel Nothman [email protected] wrote:

I haven't tested this at all personally.

On 11 September 2016 at 01:06, Nelson Liu [email protected]
wrote:

What are the specs/platform of the machine you're running this on? Id be
happy to test this out, looks useful.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7272 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yCDS545zbJ3b4FWOILR8VYCE0g2ks5qosd6gaJpZM4Ju99s
.

rth · 2016-09-12T08:37:07Z

@jnothman Thanks for the review! Interesting, the IntFloatDict looks promising..

@nelson-liu If you want to re-run this benchmark on your machine that would be great! The notebooks is linked in this comment above. I was running them on Linux (Gentoo), with conda using Python 3.5 and the latest versions of numpy/scipy, etc. On Windows without a bash shell, you may have to manually download the training dataset and the switching between different commits in the benchmark notebook may not work (not sure). My specs are Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz with an SSD, but that doesn't really matter as long as we are looking at relative performance before/after this PR. Thanks!

amueller · 2016-09-12T16:27:37Z

yeah lgtm.

jnothman · 2016-09-13T11:27:18Z

I'd still love someone to replicate the benchmarks.

jnothman · 2016-09-13T11:27:26Z

Sorry i haven't given this the time.

nelson-liu · 2016-09-13T17:57:37Z

i'm in the process of running this on my machine. This is a very well done benchmarking notebook, thanks @rth !

nelson-liu · 2016-09-13T20:41:24Z

Here are the results i'm getting

non truncated documents:

documents truncated to 200 char length:

Seems to be in line with what was reported earlier, with only slight (if any) performance degradation. This LGTM and should be a good enhancement 👍

amueller · 2016-09-13T20:52:09Z

sweet

jnothman · 2016-09-13T22:05:07Z

The tricky part is knowing who to credit in what's new!

rth · 2016-09-15T12:02:50Z

@jnothman yes, not simple, as these changes come from 2-3 PR and one issue. Congrats for the RC2 BTW!

jnothman · 2016-09-15T13:40:33Z

@ogrisel, @amueller, should we squeeze this into 0.18? Proposed what's new entry under Enhancements:

   - :ref:`Text vectorizers <text_feature_extraction>` now consume less
     memory when applied to large document collections
     (`#7272 <https://github.com/scikit-learn/scikit-learn/pull/7272>_`).
     By `Jochen Wersdörfer <https://github.com/ephes>_`,
     `Roman Yurchak <https://github.com/rth>_`,
     `Roy Blankman <https://github.com/aupiff>_`.

amueller · 2016-09-16T16:56:13Z

I'm +1. @ogrisel ?

@ephes

With thanks to @ephes and @aupiff.

@ephes

With thanks to @ephes and @aupiff.

@ephes

With thanks to @ephes and @aupiff.

@ephes

With thanks to @ephes and @aupiff.

ephes and others added 2 commits August 27, 2016 12:42

Improved memory usage in text vectorizers (PR scikit-learn#5122)

0a5021b

This update corresponds to the FastAupiffCountVectorizer implementation from PR scikit-learn#5122

Improved feature extraction performance (issue scikit-learn#5306)

1418392

ogrisel added Enhancement Large Scale labels Aug 29, 2016

ogrisel added this to the 1.0 milestone Aug 29, 2016

jnothman reviewed Sep 8, 2016
View reviewed changes

rth changed the title ~~[MRG] Text vectorizers memory usage improvement (v2)~~ [WIP] Text vectorizers memory usage improvement (v2) Sep 8, 2016

Text vecotorizer: adressing review comments

039f7eb

rth changed the title ~~[WIP] Text vectorizers memory usage improvement (v2)~~ [MRG] Text vectorizers memory usage improvement (v2) Sep 8, 2016

jnothman changed the title ~~[MRG] Text vectorizers memory usage improvement (v2)~~ [MRG+1] Text vectorizers memory usage improvement (v2) Sep 10, 2016

jnothman merged commit 941921f into scikit-learn:master Sep 15, 2016

This was referenced Sep 15, 2016

CountVectorizer Performance Problem #5306

Closed

more memory-efficient word count calculation #4968

Closed

[WIP] Text vectorizers memory usage #5122

Closed

rth deleted the slim_count_vectorizer_v4 branch September 17, 2016 12:25

amueller pushed a commit that referenced this pull request Sep 25, 2016

[MRG] Text vectorizers memory usage improvement (#7272)

8a95476

With thanks to @ephes and @aupiff.

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG] Text vectorizers memory usage improvement (scikit-learn#7272)

94a5421

With thanks to @ephes and @aupiff.

rth mentioned this pull request Jun 9, 2017

[MRG+1] Add text vectorizers benchmarks #9086

Merged

jnothman mentioned this pull request Jun 14, 2017

[MRG] Support new scipy sparse array indices, which can now be > 2^31 (Was pull request #6194) #6473

Closed

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG] Text vectorizers memory usage improvement (scikit-learn#7272)

dc40df9

With thanks to @ephes and @aupiff.

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG] Text vectorizers memory usage improvement (scikit-learn#7272)

7997cf3

With thanks to @ephes and @aupiff.

rth mentioned this pull request Jan 28, 2019

Re-evaluate memory usage in CountVectorizer #13062

Open

Uh oh!

[MRG+1] Text vectorizers memory usage improvement (v2) #7272

[MRG+1] Text vectorizers memory usage improvement (v2) #7272

Uh oh!

Conversation

rth commented Aug 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 28, 2016

Uh oh!

rth commented Aug 29, 2016

Uh oh!

jnothman Sep 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Sep 8, 2016

Uh oh!

jnothman commented Sep 8, 2016

Uh oh!

rth commented Sep 9, 2016

Uh oh!

amueller commented Sep 9, 2016

Uh oh!

jnothman commented Sep 10, 2016

Uh oh!

nelson-liu commented Sep 10, 2016

Uh oh!

jnothman commented Sep 10, 2016

Uh oh!

jnothman commented Sep 10, 2016

Uh oh!

rth commented Sep 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Sep 12, 2016

Uh oh!

jnothman commented Sep 13, 2016

Uh oh!

jnothman commented Sep 13, 2016

Uh oh!

nelson-liu commented Sep 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nelson-liu commented Sep 13, 2016

non truncated documents:

documents truncated to 200 char length:

Uh oh!

amueller commented Sep 13, 2016

Uh oh!

jnothman commented Sep 13, 2016

Uh oh!

rth commented Sep 15, 2016

Uh oh!

jnothman commented Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Sep 16, 2016

Uh oh!

Uh oh!

rth commented Aug 28, 2016 •

edited

Loading

jnothman Sep 8, 2016 •

edited

Loading

rth commented Sep 12, 2016 •

edited

Loading

nelson-liu commented Sep 13, 2016 •

edited

Loading

jnothman commented Sep 15, 2016 •

edited

Loading