ENH ensure no copy if not requested and improve transform performance in TFIDFTransformer #18843

thebabush · 2020-11-15T13:41:16Z

Reference Issues/PRs

Partial fix for #18812

What does this implement/fix? Explain your changes.

In TfidfTransformer both X and _idf_diag are CSR matrices, so there's no need to actually do sparse matrix multiplication.
Instead, we can work on their .data arrays directly.

Any other comments?

This patch still has the side-effect of allocating an array as big as X.data, which is unnecessary (again, see #18812)

ogrisel · 2020-11-16T10:34:13Z

Interesting, can you please run a quick benchmark to see the impact of this change?

jnothman · 2020-11-16T11:38:32Z

Seeing as this no longer needs to construct the dia_matrix, can you please remove that too?

rth · 2020-11-16T12:41:58Z

Interesting, can you please run a quick benchmark to see the impact of this change?

Some generic benchmarks for this operation are here but it would indeed be good to benchmark TfidfTransformer specifically.

glemaitre

Please add an entry to the change log at doc/whats_new/v*.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

glemaitre · 2020-12-19T13:07:56Z

sklearn/feature_extraction/text.py

+
+            if copy:
+                X = X.copy()
+            X.data *= self._idf_diag.data[X.indices]


It could be nice to have a small comment

Suggested change

X.data *= self._idf_diag.data[X.indices]

# sparse matrix does not support broadcasting but

# with CSR matrix we can safely pick-up the indices.

X.data *= self._idf_diag.data[X.indices]

glemaitre · 2020-12-19T13:12:18Z

We can later propose to move the parameter copy into the constructor.

jjerphan · 2021-02-08T12:42:03Z

Hi @thebabush, are you still working on this PR?

thebabush · 2021-02-10T15:33:02Z

Hi,

it kind of slipped out of my mind because of much needed away-from-pc-time during the holidays + not really using sklearn atm in my job, sorry.

just to recap everything that was requested:

the small comment thing
entry in whatsnew
remove dia_matrix
a benchmark
(for future PRs) move copy to the constructor

I also need to check the linting CI thing, but the build is not available anymore :/

Any particular suggestion for the benchmark or I can simply throw some random CSR matrices to TFIDF and measure timing?

rth · 2021-02-10T15:38:53Z

I also need to check the linting CI thing, but the build is not available anymore :/

It would be better to merge master in to synchronize, that would also re-trigger CI.

jjerphan · 2021-02-10T15:53:50Z

I'll supersede this PR.

thebabush · 2021-02-10T16:11:09Z

I'm not 100% sure what that means wrt the points above. Should I work on it on top of your PR or this means you have more changes coming and that the points above are obsolete?

jjerphan · 2021-02-10T16:34:49Z

Ah sorry @thebabush, I misread and thought you would not be working on it any-more. 🤦

Any particular suggestion for the benchmark or I can simply throw some random CSR matrices to TFIDF and measure timing?

I guess you can adapt the benchmarks of @rth referenced here for TfidfTransformer.
You can report the result for copy=True, various shapes (n, m) (taken on a log scale) and various values for density on both main and your branch. 🙂

pyperf is a small utility similar to timeit which can help you perform benchmark and analyse their results easily.

thebabush · 2021-02-10T16:42:15Z

k, will do.

thebabush · 2021-02-10T22:07:30Z

Ok, should be good.

Benchmarks: https://github.com/thebabush/tmp-benchmark-tfidf

I tried a bunch of values, dunno if they are enough.
Anyway it looks like there's a decent improvement (and for sure there's a memory improvement, though I didn't measure it).

I removed a np.ravel(). I'm not familiar with it so I dunno if it's necessary now, but it didn't look like it.

Let me know what you think (:

jjerphan · 2021-02-11T07:52:37Z

Hi @thebabush

Thanks! You are are off to a good start: we can compare both versions for various sizes and your fix seems to offer faster run-times.

To better conclude on the performances of your solution I would suggest to:

make it a script; so that it can be run directly without Jupyter easily
iterate on the combination of parameters' range (linear for density and log for n and m) to test on more and cases, including large values for density, e.g. {0.1, 0.2, 0.5})
report results in a table and/or graphically if possible
ideally include memory usage with runtime in the results. This can be done using memory_profiler (see their Python API)

thebabush · 2021-02-20T21:47:12Z

Dear @jjerphan do you have any example benchmark script using pyperf that I could use?

Using it programmatically is being a major PITA as it does too much magic for my taste and I don't really want to read all its source to understand what's going on under the hood (I can tell it uses multiprocessing but I really don't get why).
At the same time, using it from CLI is not the best thing to do since creating the sparse matrix can take quite some time and I don't think it makes much sense to measure that.

jjerphan · 2021-02-22T14:47:09Z

Dear @jjerphan do you have any example benchmark script using pyperf that I could use?

Hi @thebabush,

You can use the Runner.bench_func in Python, see this small example.

Using it programmatically is being a major PITA as it does too much magic for my taste and I don't really want to read all its source to understand what's going on under the hood (I can tell it uses multiprocessing but I really don't get why).
At the same time, using it from CLI is not the best thing to do since creating the sparse matrix can take quite some time and I don't think it makes much sense to measure that.

My comment for pyperf was just indicative: feel free to use what you want to. 🙂

thebabush · 2021-04-25T22:18:32Z

Hi,

finally got some more time to dedicate to this.
I ended up writing a simple ad-hoc benchmarking script.

In https://github.com/thebabush/tmp-benchmark-tfidf/blob/master/oldnew.csv you can read the timing results.
You can also load old.json and new.json with pandas to read the results of memory profiler.

I think the timing results show that the improvement is pretty solid.
It's trivial to adjust the parameter space in perf.py, so if you want to try a different range and let it run for a day or so, please do it.
I honestly wouldn't want to dedicate more time to this.

Let me know what you think.

babush

lucyleeow · 2023-11-01T03:32:31Z

@jjerphan any more interest in reviewing this?

lucyleeow · 2023-11-01T03:34:09Z

And sorry for the slow reply, @thebabush but are you still interested in working on this?

jjerphan · 2023-11-01T04:24:24Z

@lucyleeow: Yes, I can.

Let's wait for @thebabush to respond?

thebabush · 2023-11-05T04:38:14Z

Hi guys. A long time has passed, and I'd rather not invest more time into this. The last thing I pushed was ready to go, improves the existing code, has some benchmarks, and was passing the tests.

lucyleeow · 2023-11-05T09:17:09Z

@thebabush no problem, thanks for your work.
@jjerphan I'm happy to make any remaining changes, if you're happy to review?

jjerphan · 2023-11-05T09:28:10Z

Feel free to, Lucy. 🙂

I will review it then (you can ping me).

github-actions · 2024-01-11T15:40:47Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 7181df1. Link to the linter CI: here}

glemaitre

I pushed a couple of changes:

added a test since we don't have anything to check the basic behaviour of the copy parameter.
corrected the source because we were doing one too many copy in the case of copy=True.
move the entry of the changelog

Now it looks good to me. I might try to repeat a benchmark to check that we still have an improvement.

glemaitre

I still get the x4 speed-up for the copy=False and the code make an inplace operation so we should save memory for sure.

glemaitre · 2024-01-11T18:17:45Z

@jjerphan @lucyleeow if you want to have a look at this one I think this kind of ready and pretty much straightforward indeed.

jjerphan · 2024-01-12T09:13:20Z

@marenwestermann and @StefanieSenger: would you be interested in reviewing this PR? 🙂

StefanieSenger

@jjerphan Thanks for your trust.
I have worked out some suggestions with the support from @adrinjalali, who also wants to add some more review comments.

sklearn/feature_extraction/text.py

StefanieSenger · 2024-01-15T14:40:12Z

sklearn/feature_extraction/tests/test_text.py

+    X_csr_original = X_csr.copy()
+
+    transformer = TfidfTransformer().fit(X_csr)
+


Suggested change

# check that we transform on a copy

Usually, I learnt that we should not have to put this type of comment because the code should be self explicit. The fact that we pass copy=False/True is giving this information.

I understand, but comments like that one do help less experienced people to grasp the concepts faster and contribute to the project.

StefanieSenger · 2024-01-15T14:40:45Z

sklearn/feature_extraction/tests/test_text.py

+    X_transform = transformer.transform(X_csr, copy=True)
+    assert_allclose_dense_sparse(X_csr, X_csr_original)
+    assert X_transform is not X_csr
+


Suggested change

# check that we transform in place

adrinjalali

I agree with @StefanieSenger , we don't need idf_ to be a property anymore.

sklearn/feature_extraction/text.py

glemaitre

I made all the asked changes (apart from the comment in the tests). I'll give a try a the dtype preservation in another PR.

adrinjalali

LGTM. But I was very confused with the fact that our tf is not really a frequency as you'd expect reading wikipedia page of tf-idf, and it's rather a count which apparently is the norm in IR.

glemaitre · 2024-01-15T18:59:14Z

I open #28136 where the dtype preservation seems quite straightforward. It would impact also the TfidfVectorizer if the dtype is set to np.float32.

… in TFIDFTransformer (scikit-learn#18843) Co-authored-by: Guillaume Lemaitre <[email protected]>

github-actions bot added the module:feature_extraction label Nov 15, 2020

glemaitre changed the title ~~faster tfidf transform~~ ENH ensure no copy if not requested and improve transform performance in TFIDFTransformer Dec 19, 2020

glemaitre reviewed Dec 19, 2020

View reviewed changes

cmarmo added the Needs work label Jan 19, 2021

Base automatically changed from master to main January 22, 2021 10:53

jjerphan mentioned this pull request Feb 10, 2021

ENH: Ensure no copy if not requested and improve transform performance in TFIDFTransformer #19430

Closed

thebabush force-pushed the feature-faster-tfidf-transform branch from 7f665e9 to 29637d6 Compare February 10, 2021 20:57

thebabush added 4 commits April 25, 2021 15:38

faster tfidf transform

7521aa1

add comment to tfidf

752be1d

use vector instead of sparse matrix in tfidf

a7d1625

updated release notes

9e07743

thebabush force-pushed the feature-faster-tfidf-transform branch from 29637d6 to 9e07743 Compare April 25, 2021 22:10

lucyleeow added the Stalled label Nov 5, 2023

glemaitre self-requested a review November 6, 2023 07:43

Merge remote-tracking branch 'origin/main' into pr/thebabush/18843

7d60b85

glemaitre added 2 commits January 11, 2024 18:02

Add test

015b8e3

fix comment

648974a

glemaitre approved these changes Jan 11, 2024

View reviewed changes

glemaitre added Waiting for Second Reviewer First reviewer is done, need a second one! and removed Stalled Needs work labels Jan 11, 2024

glemaitre approved these changes Jan 11, 2024

View reviewed changes

StefanieSenger reviewed Jan 15, 2024

View reviewed changes

adrinjalali reviewed Jan 15, 2024

View reviewed changes

sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved

glemaitre self-requested a review January 15, 2024 18:00

glemaitre added 2 commits January 15, 2024 19:25

address comment

12ee4b8

Merge remote-tracking branch 'origin/main' into pr/thebabush/18843

7181df1

glemaitre approved these changes Jan 15, 2024

View reviewed changes

adrinjalali approved these changes Jan 15, 2024

View reviewed changes

adrinjalali enabled auto-merge (squash) January 15, 2024 18:31

glemaitre mentioned this pull request Jan 15, 2024

ENH TfidfTransformer perserves np.float32 dtype #28136

Merged

adrinjalali merged commit 8a71b84 into scikit-learn:main Jan 15, 2024

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024

ENH ensure no copy if not requested and improve transform performance…

4e9d3ec

… in TFIDFTransformer (scikit-learn#18843) Co-authored-by: Guillaume Lemaitre <[email protected]>

		X_csr_original = X_csr.copy()

		transformer = TfidfTransformer().fit(X_csr)

Uh oh!

ENH ensure no copy if not requested and improve transform performance in TFIDFTransformer #18843

ENH ensure no copy if not requested and improve transform performance in TFIDFTransformer #18843

Uh oh!

Conversation

thebabush commented Nov 15, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

ogrisel commented Nov 16, 2020

Uh oh!

jnothman commented Nov 16, 2020

Uh oh!

rth commented Nov 16, 2020

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre Dec 19, 2020

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Dec 19, 2020

Uh oh!

jjerphan commented Feb 8, 2021

Uh oh!

thebabush commented Feb 10, 2021

Uh oh!

rth commented Feb 10, 2021

Uh oh!

jjerphan commented Feb 10, 2021

Uh oh!

thebabush commented Feb 10, 2021

Uh oh!

jjerphan commented Feb 10, 2021

Uh oh!

thebabush commented Feb 10, 2021

Uh oh!

thebabush commented Feb 10, 2021

Uh oh!

jjerphan commented Feb 11, 2021

Uh oh!

thebabush commented Feb 20, 2021

Uh oh!

jjerphan commented Feb 22, 2021

Uh oh!

thebabush commented Apr 25, 2021

Uh oh!

lucyleeow commented Nov 1, 2023

Uh oh!

lucyleeow commented Nov 1, 2023

Uh oh!

jjerphan commented Nov 1, 2023

Uh oh!

thebabush commented Nov 5, 2023

Uh oh!

lucyleeow commented Nov 5, 2023

Uh oh!

jjerphan commented Nov 5, 2023

Uh oh!

github-actions bot commented Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jan 11, 2024

Uh oh!

jjerphan commented Jan 12, 2024

Uh oh!

StefanieSenger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 11, 2024 •

edited

Loading

StefanieSenger Jan 15, 2024 •

edited

Loading