Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Deprecate TfidfVectorizer #14966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

rth
Copy link
Member

@rth rth commented Sep 12, 2019

Partially addresses #14951

Tentatively deprecates TfidfVectorizer in favor of using CountVectorizer in a pipeline with TfidfTransformer. The main argument in favour of the deprecation is the complexity of the API and the number of parameters in the TfidfVectorizer class. That would also make the use of CountVectorizer and HashingVectorizer more symetric.

This will affect a lot of uses as TfidfVectorizer is frequently used. At least 3 approvals would probably be good as this is potentially controversial.

TODO:

  • make sure the documentation is fully consistent. I can do it once we are sure we want to deprecate it.

@thomasjpfan
Copy link
Member

I like this idea in general, but I think there will be heavy resistance given how much it is used.

On a technical note, TfidfTransformer copies by default, and TfidfVectorizer gets around this by calling transform(..., copy=False). We may need to add a copy parameter to TfidfTransformer to have the same behavior:

vectorizer = Pipeline([
    ('vect', CountVectorizer(...)),
    ('idf', TfidfTransformer(..., copy=False))
])

@jnothman
Copy link
Member

jnothman commented Sep 13, 2019 via email

@qinhanmin2014
Copy link
Member

I like this idea in general, but I think there will be heavy resistance given how much it is used.

+1, I prefer to keep it, but won't oppose if we decide to deprecate it. TFIDF is a basic text preprocessing technique and I guess users will be unhappy if they need to use a pipeline?

The main argument in favour of the deprecation is the complexity of the API and the number of parameters in the TfidfVectorizer class.

I don't think the API is complex. We mention in the doc that "TfidfVectorizer is equivalent to CountVectorizer followed by TfidfTransformer." and parameters in TfidfVectorizer = parameters in CountVectorizer + parameters in TfidfTransformer.

@qinhanmin2014
Copy link
Member

We may need to add a copy parameter to TfidfTransformer to have the same behavior:

+1, maybe also remove the copy parameter in transform.

@rth
Copy link
Member Author

rth commented Oct 1, 2019

TFIDF is a basic text preprocessing technique and I guess users will be unhappy if they need to use a pipeline?*

That is a risk yes, particularly with all the things that we have been deprecating lately.

I don't think the API is complex.

What I mean by complex is that it has 21 parameters and performs numerous tasks (pre-processing, tokenization, n-grams concatenation, counting, IDF weighting, L2 normalization) some of which users are not aware. As long as they use it as a black box it works fine, however, as soon as they need to tune any part of that pipeline (which is my experience happens often) problems start:

@qinhanmin2014
Copy link
Member

That is a risk yes, particularly with all the things that we have been deprecating lately.

I guess this risk is not trivial :)

What I mean by complex is that it has 21 parameters

That's because there're 17 parameters in CountVectorizer, TfidfTransformer only introduces 4 extra parameters.

performs numerous tasks (pre-processing, tokenization, n-grams concatenation, counting, IDF weighting, L2 normalization) some of which users are not aware.

I think this is a seperate issue? (e.g., split the vectorizer into several classes?)

users don't realize the options they use (e.g. #14748) because instead considering each stage of the pipeline, they are presented with these 21 parameters at once.

Not sure whether this is a good reason. I think the doc is clear enough. "TfidfVectorizer is equivalent to CountVectorizer followed by TfidfTransformer."

@phtully
Copy link

phtully commented Dec 19, 2019

I like this idea in general, but I think there will be heavy resistance given how much it is used.

On a technical note, TfidfTransformer copies by default, and TfidfVectorizer gets around this by calling transform(..., copy=False). We may need to add a copy parameter to TfidfTransformer to have the same behavior:

vectorizer = Pipeline([
    ('vect', CountVectorizer(...)),
    ('idf', TfidfTransformer(..., copy=False))
])

Shouldn't adding a copy parameter to TfidfTransformer be a good issue to address regardless? As of now, if one wanted to call fit_transform on a TfIdfTransformer within a pipeline, they wouldn't be able to set the copy parameter. It is only useful when calling fit() and transform() separately atm.

@jnothman
Copy link
Member

jnothman commented Dec 19, 2019 via email

@rth rth mentioned this pull request Dec 25, 2019
@rth
Copy link
Member Author

rth commented Feb 21, 2020

Closing as it's unlikely to happen.

@rth rth closed this Feb 21, 2020
@NicolasHug
Copy link
Member

Do you remember why this was deemed "unlikely to happen" @rth?

My understanding is that in general we would be OK to add a copy param to TfidfTransformer and then deprecate TfidfVectorizer?

@rth
Copy link
Member Author

rth commented Apr 28, 2020

I was just closing my PRs that I didn't see being merged in the near future. But overall yes, I think it would still be good if this happened, and since there is indeed no specific issue for it and all discussion is here I'll re-open.

My understanding is that in general we would be OK to add a copy param to TfidfTransformer and then deprecate TfidfVectorizer?

That and having a reasonable way to get feature names in a make_pipeline(CountVectorizer(), TfidfTransformer()) without pipeline slicing.

@rth rth reopened this Apr 28, 2020
Base automatically changed from master to main January 22, 2021 10:51
@cmarmo cmarmo added the Needs Decision Requires decision label Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants