Deprecate TfidfVectorizer #14966

rth · 2019-09-12T13:08:17Z

Partially addresses #14951

Tentatively deprecates TfidfVectorizer in favor of using CountVectorizer in a pipeline with TfidfTransformer. The main argument in favour of the deprecation is the complexity of the API and the number of parameters in the TfidfVectorizer class. That would also make the use of CountVectorizer and HashingVectorizer more symetric.

This will affect a lot of uses as TfidfVectorizer is frequently used. At least 3 approvals would probably be good as this is potentially controversial.

TODO:

make sure the documentation is fully consistent. I can do it once we are sure we want to deprecate it.

thomasjpfan · 2019-09-12T16:32:48Z

I like this idea in general, but I think there will be heavy resistance given how much it is used.

On a technical note, TfidfTransformer copies by default, and TfidfVectorizer gets around this by calling transform(..., copy=False). We may need to add a copy parameter to TfidfTransformer to have the same behavior:

vectorizer = Pipeline([
    ('vect', CountVectorizer(...)),
    ('idf', TfidfTransformer(..., copy=False))
])

jnothman · 2019-09-13T00:50:13Z

This is also in response to confusions like #8265

qinhanmin2014 · 2019-09-15T03:52:54Z

I like this idea in general, but I think there will be heavy resistance given how much it is used.

+1, I prefer to keep it, but won't oppose if we decide to deprecate it. TFIDF is a basic text preprocessing technique and I guess users will be unhappy if they need to use a pipeline?

The main argument in favour of the deprecation is the complexity of the API and the number of parameters in the TfidfVectorizer class.

I don't think the API is complex. We mention in the doc that "TfidfVectorizer is equivalent to CountVectorizer followed by TfidfTransformer." and parameters in TfidfVectorizer = parameters in CountVectorizer + parameters in TfidfTransformer.

qinhanmin2014 · 2019-09-15T03:56:16Z

We may need to add a copy parameter to TfidfTransformer to have the same behavior:

+1, maybe also remove the copy parameter in transform.

rth · 2019-10-01T09:15:26Z

TFIDF is a basic text preprocessing technique and I guess users will be unhappy if they need to use a pipeline?*

That is a risk yes, particularly with all the things that we have been deprecating lately.

I don't think the API is complex.

What I mean by complex is that it has 21 parameters and performs numerous tasks (pre-processing, tokenization, n-grams concatenation, counting, IDF weighting, L2 normalization) some of which users are not aware. As long as they use it as a black box it works fine, however, as soon as they need to tune any part of that pipeline (which is my experience happens often) problems start:

the way you have to subclass vectorizers plug-in something into that pipeline, while re-using the rest is messy. Deprecating TfidfVectorizer is only one step in the direction of isolating that to a single CountVectorizer class, to be able to do something with it later.
users don't realize the options they use (e.g. Added option to use standard idf term for TfidfTransformer and TfidfVectorizer #14748) because instead considering each stage of the pipeline, they are presented with these 21 parameters at once.

qinhanmin2014 · 2019-10-02T03:10:19Z

That is a risk yes, particularly with all the things that we have been deprecating lately.

I guess this risk is not trivial :)

What I mean by complex is that it has 21 parameters

That's because there're 17 parameters in CountVectorizer, TfidfTransformer only introduces 4 extra parameters.

performs numerous tasks (pre-processing, tokenization, n-grams concatenation, counting, IDF weighting, L2 normalization) some of which users are not aware.

I think this is a seperate issue? (e.g., split the vectorizer into several classes?)

users don't realize the options they use (e.g. #14748) because instead considering each stage of the pipeline, they are presented with these 21 parameters at once.

Not sure whether this is a good reason. I think the doc is clear enough. "TfidfVectorizer is equivalent to CountVectorizer followed by TfidfTransformer."

phtully · 2019-12-19T21:31:22Z

I like this idea in general, but I think there will be heavy resistance given how much it is used.

On a technical note, TfidfTransformer copies by default, and TfidfVectorizer gets around this by calling transform(..., copy=False). We may need to add a copy parameter to TfidfTransformer to have the same behavior:
vectorizer = Pipeline([
    ('vect', CountVectorizer(...)),
    ('idf', TfidfTransformer(..., copy=False))
])

Shouldn't adding a copy parameter to TfidfTransformer be a good issue to address regardless? As of now, if one wanted to call fit_transform on a TfIdfTransformer within a pipeline, they wouldn't be able to set the copy parameter. It is only useful when calling fit() and transform() separately atm.

jnothman · 2019-12-19T23:26:46Z

+1 for adding copy to TfidfTransformer. PR welcome

rth · 2020-02-21T16:07:14Z

Closing as it's unlikely to happen.

NicolasHug · 2020-04-28T17:47:28Z

Do you remember why this was deemed "unlikely to happen" @rth?

My understanding is that in general we would be OK to add a copy param to TfidfTransformer and then deprecate TfidfVectorizer?

rth · 2020-04-28T17:55:14Z

I was just closing my PRs that I didn't see being merged in the near future. But overall yes, I think it would still be good if this happened, and since there is indeed no specific issue for it and all discussion is here I'll re-open.

My understanding is that in general we would be OK to add a copy param to TfidfTransformer and then deprecate TfidfVectorizer?

That and having a reasonable way to get feature names in a make_pipeline(CountVectorizer(), TfidfTransformer()) without pipeline slicing.

rth added 4 commits September 12, 2019 11:29

Deprecating TfidfVectorizer

d8fa3db

Handle deprecation warnings in tests

2677668

Fix examples

052b60e

Lint

13dfb66

rth mentioned this pull request Dec 25, 2019

TffvVectorizer Enconding #15970

Closed

rth closed this Feb 21, 2020

rth reopened this Apr 28, 2020

github-actions bot added module:datasets module:feature_extraction labels Apr 28, 2020

Base automatically changed from master to main January 22, 2021 10:51

cmarmo added the Needs Decision Requires decision label Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Deprecate TfidfVectorizer #14966

Deprecate TfidfVectorizer #14966

Uh oh!

rth commented Sep 12, 2019

Uh oh!

thomasjpfan commented Sep 12, 2019

Uh oh!

jnothman commented Sep 13, 2019 via email

Uh oh!

qinhanmin2014 commented Sep 15, 2019

Uh oh!

qinhanmin2014 commented Sep 15, 2019

Uh oh!

rth commented Oct 1, 2019

Uh oh!

qinhanmin2014 commented Oct 2, 2019

Uh oh!

phtully commented Dec 19, 2019

Uh oh!

jnothman commented Dec 19, 2019 via email

Uh oh!

rth commented Feb 21, 2020

Uh oh!

NicolasHug commented Apr 28, 2020

Uh oh!

rth commented Apr 28, 2020

Uh oh!

Uh oh!

Uh oh!

Deprecate TfidfVectorizer #14966

Are you sure you want to change the base?

Deprecate TfidfVectorizer #14966

Uh oh!

Conversation

rth commented Sep 12, 2019

Uh oh!

thomasjpfan commented Sep 12, 2019

Uh oh!

jnothman commented Sep 13, 2019 via email

Uh oh!

qinhanmin2014 commented Sep 15, 2019

Uh oh!

qinhanmin2014 commented Sep 15, 2019

Uh oh!

rth commented Oct 1, 2019

Uh oh!

qinhanmin2014 commented Oct 2, 2019

Uh oh!

phtully commented Dec 19, 2019

Uh oh!

jnothman commented Dec 19, 2019 via email

Uh oh!

rth commented Feb 21, 2020

Uh oh!

NicolasHug commented Apr 28, 2020

Uh oh!

rth commented Apr 28, 2020

Uh oh!

Uh oh!