Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@thebabush
Copy link
Contributor

Reference Issues/PRs

Partial fix for #18812

What does this implement/fix? Explain your changes.

In TfidfTransformer both X and _idf_diag are CSR matrices, so there's no need to actually do sparse matrix multiplication.
Instead, we can work on their .data arrays directly.

Any other comments?

This patch still has the side-effect of allocating an array as big as X.data, which is unnecessary (again, see #18812)

@ogrisel
Copy link
Member

ogrisel commented Nov 16, 2020

Interesting, can you please run a quick benchmark to see the impact of this change?

@jnothman
Copy link
Member

Seeing as this no longer needs to construct the dia_matrix, can you please remove that too?

@rth
Copy link
Member

rth commented Nov 16, 2020

Interesting, can you please run a quick benchmark to see the impact of this change?

Some generic benchmarks for this operation are here but it would indeed be good to benchmark TfidfTransformer specifically.

@glemaitre glemaitre changed the title faster tfidf transform ENH ensure no copy if not requested and improve transform performance in TFIDFTransformer Dec 19, 2020
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an entry to the change log at doc/whats_new/v*.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.


if copy:
X = X.copy()
X.data *= self._idf_diag.data[X.indices]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be nice to have a small comment

Suggested change
X.data *= self._idf_diag.data[X.indices]
# sparse matrix does not support broadcasting but
# with CSR matrix we can safely pick-up the indices.
X.data *= self._idf_diag.data[X.indices]

@glemaitre
Copy link
Member

We can later propose to move the parameter copy into the constructor.

Base automatically changed from master to main January 22, 2021 10:53
@jjerphan
Copy link
Member

jjerphan commented Feb 8, 2021

Hi @thebabush, are you still working on this PR?

@thebabush
Copy link
Contributor Author

Hi,

it kind of slipped out of my mind because of much needed away-from-pc-time during the holidays + not really using sklearn atm in my job, sorry.

just to recap everything that was requested:

  • the small comment thing
  • entry in whatsnew
  • remove dia_matrix
  • a benchmark
  • (for future PRs) move copy to the constructor

I also need to check the linting CI thing, but the build is not available anymore :/

Any particular suggestion for the benchmark or I can simply throw some random CSR matrices to TFIDF and measure timing?

@rth
Copy link
Member

rth commented Feb 10, 2021

I also need to check the linting CI thing, but the build is not available anymore :/

It would be better to merge master in to synchronize, that would also re-trigger CI.

@jjerphan
Copy link
Member

I'll supersede this PR.

@thebabush
Copy link
Contributor Author

I'm not 100% sure what that means wrt the points above. Should I work on it on top of your PR or this means you have more changes coming and that the points above are obsolete?

@jjerphan
Copy link
Member

Ah sorry @thebabush, I misread and thought you would not be working on it any-more. 🤦

Any particular suggestion for the benchmark or I can simply throw some random CSR matrices to TFIDF and measure timing?

I guess you can adapt the benchmarks of @rth referenced here for TfidfTransformer.
You can report the result for copy=True, various shapes (n, m) (taken on a log scale) and various values for density on both main and your branch. 🙂

pyperf is a small utility similar to timeit which can help you perform benchmark and analyse their results easily.

@thebabush
Copy link
Contributor Author

k, will do.

@thebabush thebabush force-pushed the feature-faster-tfidf-transform branch from 7f665e9 to 29637d6 Compare February 10, 2021 20:57
@thebabush
Copy link
Contributor Author

Ok, should be good.

Benchmarks: https://github.com/thebabush/tmp-benchmark-tfidf

I tried a bunch of values, dunno if they are enough.
Anyway it looks like there's a decent improvement (and for sure there's a memory improvement, though I didn't measure it).

I removed a np.ravel(). I'm not familiar with it so I dunno if it's necessary now, but it didn't look like it.

Let me know what you think (:

@jjerphan
Copy link
Member

Hi @thebabush

Thanks! You are are off to a good start: we can compare both versions for various sizes and your fix seems to offer faster run-times.

To better conclude on the performances of your solution I would suggest to:

  • make it a script; so that it can be run directly without Jupyter easily
  • iterate on the combination of parameters' range (linear for density and log for n and m) to test on more and cases, including large values for density, e.g. {0.1, 0.2, 0.5})
  • report results in a table and/or graphically if possible
  • ideally include memory usage with runtime in the results. This can be done using memory_profiler (see their Python API)

@thebabush
Copy link
Contributor Author

Dear @jjerphan do you have any example benchmark script using pyperf that I could use?

Using it programmatically is being a major PITA as it does too much magic for my taste and I don't really want to read all its source to understand what's going on under the hood (I can tell it uses multiprocessing but I really don't get why).
At the same time, using it from CLI is not the best thing to do since creating the sparse matrix can take quite some time and I don't think it makes much sense to measure that.

@jjerphan
Copy link
Member

Dear @jjerphan do you have any example benchmark script using pyperf that I could use?

Hi @thebabush,

You can use the Runner.bench_func in Python, see this small example.

Using it programmatically is being a major PITA as it does too much magic for my taste and I don't really want to read all its source to understand what's going on under the hood (I can tell it uses multiprocessing but I really don't get why).
At the same time, using it from CLI is not the best thing to do since creating the sparse matrix can take quite some time and I don't think it makes much sense to measure that.

My comment for pyperf was just indicative: feel free to use what you want to. 🙂

@thebabush thebabush force-pushed the feature-faster-tfidf-transform branch from 29637d6 to 9e07743 Compare April 25, 2021 22:10
@thebabush
Copy link
Contributor Author

Hi,

finally got some more time to dedicate to this.
I ended up writing a simple ad-hoc benchmarking script.

In https://github.com/thebabush/tmp-benchmark-tfidf/blob/master/oldnew.csv you can read the timing results.
You can also load old.json and new.json with pandas to read the results of memory profiler.

I think the timing results show that the improvement is pretty solid.
It's trivial to adjust the parameter space in perf.py, so if you want to try a different range and let it run for a day or so, please do it.
I honestly wouldn't want to dedicate more time to this.

Let me know what you think.

babush

@lucyleeow
Copy link
Member

@jjerphan any more interest in reviewing this?

@lucyleeow
Copy link
Member

And sorry for the slow reply, @thebabush but are you still interested in working on this?

@jjerphan
Copy link
Member

jjerphan commented Nov 1, 2023

@lucyleeow: Yes, I can.

Let's wait for @thebabush to respond?

@thebabush
Copy link
Contributor Author

Hi guys. A long time has passed, and I'd rather not invest more time into this. The last thing I pushed was ready to go, improves the existing code, has some benchmarks, and was passing the tests.

@lucyleeow
Copy link
Member

@thebabush no problem, thanks for your work.
@jjerphan I'm happy to make any remaining changes, if you're happy to review?

@jjerphan
Copy link
Member

jjerphan commented Nov 5, 2023

Feel free to, Lucy. 🙂

I will review it then (you can ping me).

@glemaitre glemaitre self-requested a review November 6, 2023 07:43
@github-actions
Copy link

github-actions bot commented Jan 11, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 7181df1. Link to the linter CI: here

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a couple of changes:

  • added a test since we don't have anything to check the basic behaviour of the copy parameter.
  • corrected the source because we were doing one too many copy in the case of copy=True.
  • move the entry of the changelog

Now it looks good to me. I might try to repeat a benchmark to check that we still have an improvement.

@glemaitre glemaitre added Waiting for Second Reviewer First reviewer is done, need a second one! and removed Stalled Needs work labels Jan 11, 2024
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still get the x4 speed-up for the copy=False and the code make an inplace operation so we should save memory for sure.

@glemaitre
Copy link
Member

@jjerphan @lucyleeow if you want to have a look at this one I think this kind of ready and pretty much straightforward indeed.

@jjerphan
Copy link
Member

@marenwestermann and @StefanieSenger: would you be interested in reviewing this PR? 🙂

Copy link
Member

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjerphan Thanks for your trust.
I have worked out some suggestions with the support from @adrinjalali, who also wants to add some more review comments.

X_csr_original = X_csr.copy()

transformer = TfidfTransformer().fit(X_csr)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# check that we transform on a copy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, I learnt that we should not have to put this type of comment because the code should be self explicit. The fact that we pass copy=False/True is giving this information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but comments like that one do help less experienced people to grasp the concepts faster and contribute to the project.

X_transform = transformer.transform(X_csr, copy=True)
assert_allclose_dense_sparse(X_csr, X_csr_original)
assert X_transform is not X_csr

Copy link
Member

@StefanieSenger StefanieSenger Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# check that we transform in place

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @StefanieSenger , we don't need idf_ to be a property anymore.

@glemaitre glemaitre self-requested a review January 15, 2024 18:00
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made all the asked changes (apart from the comment in the tests). I'll give a try a the dtype preservation in another PR.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But I was very confused with the fact that our tf is not really a frequency as you'd expect reading wikipedia page of tf-idf, and it's rather a count which apparently is the norm in IR.

@glemaitre
Copy link
Member

I open #28136 where the dtype preservation seems quite straightforward. It would impact also the TfidfVectorizer if the dtype is set to np.float32.

@adrinjalali adrinjalali merged commit 8a71b84 into scikit-learn:main Jan 15, 2024
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:feature_extraction Waiting for Second Reviewer First reviewer is done, need a second one!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants