-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
ENH ensure no copy if not requested and improve transform performance in TFIDFTransformer #18843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH ensure no copy if not requested and improve transform performance in TFIDFTransformer #18843
Conversation
|
Interesting, can you please run a quick benchmark to see the impact of this change? |
|
Seeing as this no longer needs to construct the |
Some generic benchmarks for this operation are here but it would indeed be good to benchmark TfidfTransformer specifically. |
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an entry to the change log at doc/whats_new/v*.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.
sklearn/feature_extraction/text.py
Outdated
|
|
||
| if copy: | ||
| X = X.copy() | ||
| X.data *= self._idf_diag.data[X.indices] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be nice to have a small comment
| X.data *= self._idf_diag.data[X.indices] | |
| # sparse matrix does not support broadcasting but | |
| # with CSR matrix we can safely pick-up the indices. | |
| X.data *= self._idf_diag.data[X.indices] |
|
We can later propose to move the parameter |
|
Hi @thebabush, are you still working on this PR? |
|
Hi, it kind of slipped out of my mind because of much needed away-from-pc-time during the holidays + not really using sklearn atm in my job, sorry. just to recap everything that was requested:
I also need to check the linting CI thing, but the build is not available anymore :/ Any particular suggestion for the benchmark or I can simply throw some random CSR matrices to TFIDF and measure timing? |
It would be better to merge master in to synchronize, that would also re-trigger CI. |
|
I'll supersede this PR. |
|
I'm not 100% sure what that means wrt the points above. Should I work on it on top of your PR or this means you have more changes coming and that the points above are obsolete? |
|
Ah sorry @thebabush, I misread and thought you would not be working on it any-more. 🤦
I guess you can adapt the benchmarks of @rth referenced here for
|
|
k, will do. |
7f665e9 to
29637d6
Compare
|
Ok, should be good. Benchmarks: https://github.com/thebabush/tmp-benchmark-tfidf I tried a bunch of values, dunno if they are enough. I removed a Let me know what you think (: |
|
Hi @thebabush Thanks! You are are off to a good start: we can compare both versions for various sizes and your fix seems to offer faster run-times. To better conclude on the performances of your solution I would suggest to:
|
|
Dear @jjerphan do you have any example benchmark script using pyperf that I could use? Using it programmatically is being a major PITA as it does too much magic for my taste and I don't really want to read all its source to understand what's going on under the hood (I can tell it uses multiprocessing but I really don't get why). |
Hi @thebabush, You can use the
My comment for |
29637d6 to
9e07743
Compare
|
Hi, finally got some more time to dedicate to this. In https://github.com/thebabush/tmp-benchmark-tfidf/blob/master/oldnew.csv you can read the timing results. I think the timing results show that the improvement is pretty solid. Let me know what you think. babush |
|
@jjerphan any more interest in reviewing this? |
|
And sorry for the slow reply, @thebabush but are you still interested in working on this? |
|
@lucyleeow: Yes, I can. Let's wait for @thebabush to respond? |
|
Hi guys. A long time has passed, and I'd rather not invest more time into this. The last thing I pushed was ready to go, improves the existing code, has some benchmarks, and was passing the tests. |
|
@thebabush no problem, thanks for your work. |
|
Feel free to, Lucy. 🙂 I will review it then (you can ping me). |
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a couple of changes:
- added a test since we don't have anything to check the basic behaviour of the
copyparameter. - corrected the source because we were doing one too many copy in the case of
copy=True. - move the entry of the changelog
Now it looks good to me. I might try to repeat a benchmark to check that we still have an improvement.
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still get the x4 speed-up for the copy=False and the code make an inplace operation so we should save memory for sure.
|
@jjerphan @lucyleeow if you want to have a look at this one I think this kind of ready and pretty much straightforward indeed. |
|
@marenwestermann and @StefanieSenger: would you be interested in reviewing this PR? 🙂 |
StefanieSenger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjerphan Thanks for your trust.
I have worked out some suggestions with the support from @adrinjalali, who also wants to add some more review comments.
| X_csr_original = X_csr.copy() | ||
|
|
||
| transformer = TfidfTransformer().fit(X_csr) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # check that we transform on a copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, I learnt that we should not have to put this type of comment because the code should be self explicit. The fact that we pass copy=False/True is giving this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, but comments like that one do help less experienced people to grasp the concepts faster and contribute to the project.
| X_transform = transformer.transform(X_csr, copy=True) | ||
| assert_allclose_dense_sparse(X_csr, X_csr_original) | ||
| assert X_transform is not X_csr | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # check that we transform in place |
adrinjalali
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @StefanieSenger , we don't need idf_ to be a property anymore.
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made all the asked changes (apart from the comment in the tests). I'll give a try a the dtype preservation in another PR.
adrinjalali
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. But I was very confused with the fact that our tf is not really a frequency as you'd expect reading wikipedia page of tf-idf, and it's rather a count which apparently is the norm in IR.
|
I open #28136 where the |
… in TFIDFTransformer (scikit-learn#18843) Co-authored-by: Guillaume Lemaitre <[email protected]>
Reference Issues/PRs
Partial fix for #18812
What does this implement/fix? Explain your changes.
In
TfidfTransformerbothXand_idf_diagare CSR matrices, so there's no need to actually do sparse matrix multiplication.Instead, we can work on their
.dataarrays directly.Any other comments?
This patch still has the side-effect of allocating an array as big as
X.data, which is unnecessary (again, see #18812)