-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] Text vectorizers memory usage improvement (v2) #7272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This update corresponds to the FastAupiffCountVectorizer implementation from PR scikit-learn#5122
Thanks for the thorough benchmarking. I'd like to look at this soon. At the moment we're working towards a release so this may slip into the next one as non-critical and potentially contentious. So if it's quiet here, please ping in a couple of weeks / once 0.18 is out. |
No problem, I'll ping you about this in a few weeks. Thanks. |
map_index[old_val] = new_val | ||
|
||
# swap columns in place | ||
indices = X.indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
x.indices = map_index.take(X.indices, mode='clip')
might be identical to the following and faster.
I have rerun the benchmarks following modifications after @jnothman's review, and while they show the expected memory usage improvements, I was not able to reproduce the previously reported performance deterioration (including for the previous commit). I'm using the same scikit-learn git version, laptop and conda virtualenv with dependencies: not sure what happened. Maybe these benchmarks are not fully reliable. Anyway now it looks like this PR allows to decrease memory use in text vectorizers with approximately identical compute time. |
I am sure we can improve this further, but I think this is great if we want One concern I have with the dicts is how this is affected by document On 9 September 2016 at 02:04, Roman Yurchak [email protected]
|
@jnothman As you requested, I have re-cheched the performance with short documents (same collection truncated to 200 char long documents, with the average doc. length otherwise being 2700 char), and did not see any performance deterioration with this PR, although the memory improvements were less significant. As far as I can tell, looking at the profiling results (same notebook at the very end) for
Other optimizations that I have tried for this PR but that didn't not bring any improvement, are
I would be quite tempted to see if porting the second (inner) loop of |
I'd be happy with merging this for 0.18. 👍 |
Ideally, I'd like to see the benchmarks reproduced on other platforms, but this LGTM otherwise. (@amueller, was yours a LGTM?) Faster tokenization I think we may need to leave as someone else's problem. The inner loop may even be made faster by using specialised dicts like the IntFloatDict in utils. Alternatively, I wonder whether the current sparse matrix and |
What are the specs/platform of the machine you're running this on? Id be happy to test this out, looks useful. |
I haven't tested this at all personally. On 11 September 2016 at 01:06, Nelson Liu [email protected] wrote:
|
but see the ipynb posted above On 11 September 2016 at 01:24, Joel Nothman [email protected] wrote:
|
@jnothman Thanks for the review! Interesting, the @nelson-liu If you want to re-run this benchmark on your machine that would be great! The notebooks is linked in this comment above. I was running them on Linux (Gentoo), with conda using Python 3.5 and the latest versions of numpy/scipy, etc. On Windows without a bash shell, you may have to manually download the training dataset and the switching between different commits in the benchmark notebook may not work (not sure). My specs are Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz with an SSD, but that doesn't really matter as long as we are looking at relative performance before/after this PR. Thanks! |
yeah lgtm. |
I'd still love someone to replicate the benchmarks. |
Sorry i haven't given this the time. |
i'm in the process of running this on my machine. This is a very well done benchmarking notebook, thanks @rth ! |
sweet |
The tricky part is knowing who to credit in what's new! |
@jnothman yes, not simple, as these changes come from 2-3 PR and one issue. Congrats for the RC2 BTW! |
@ogrisel, @amueller, should we squeeze this into 0.18? Proposed what's new entry under Enhancements: - :ref:`Text vectorizers <text_feature_extraction>` now consume less
memory when applied to large document collections
(`#7272 <https://github.com/scikit-learn/scikit-learn/pull/7272>_`).
By `Jochen Wersdörfer <https://github.com/ephes>_`,
`Roman Yurchak <https://github.com/rth>_`,
`Roy Blankman <https://github.com/aupiff>_`. |
I'm +1. @ogrisel ? |
This PR benchmarks and slightly refactors previously proposed improvements for performance and memory usage of text vectorizers,
HashingVectorizer
)The memory optimisation of the HashingVecotorizer from PR #5122 could be done in a separate PR.
The benchmark notebook uses the Enron email collection and considers the performance and memory usage impact of these changes on
CountVectorizer
,TfidfVectorizer
andHashingVectorizer
for different number of documents (Dataset size
),Note that the benchmarks may not be fully reliable (significant I/O operations for reading the data from disk) and sometimes produce different timing for identical runs.
Memory wise there is indeed a significant improvement,
however there are some performance drawbacks,up to 2-3 times slowerI'm not sure if the trade-off is worth it,this is definitely oriented to large scale document collections, but one way or the other this should allow to address those PR & opened issue.Edit: as of Sept 7, 2016 I'm not able to reproduce the performance deterioration anymore (cf comments below)