-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Deprecate TfidfVectorizer #14966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Deprecate TfidfVectorizer #14966
Conversation
I like this idea in general, but I think there will be heavy resistance given how much it is used. On a technical note, vectorizer = Pipeline([
('vect', CountVectorizer(...)),
('idf', TfidfTransformer(..., copy=False))
]) |
This is also in response to confusions like
#8265
|
+1, I prefer to keep it, but won't oppose if we decide to deprecate it. TFIDF is a basic text preprocessing technique and I guess users will be unhappy if they need to use a pipeline?
I don't think the API is complex. We mention in the doc that "TfidfVectorizer is equivalent to CountVectorizer followed by TfidfTransformer." and parameters in TfidfVectorizer = parameters in CountVectorizer + parameters in TfidfTransformer. |
+1, maybe also remove the copy parameter in transform. |
That is a risk yes, particularly with all the things that we have been deprecating lately.
What I mean by complex is that it has 21 parameters and performs numerous tasks (pre-processing, tokenization, n-grams concatenation, counting, IDF weighting, L2 normalization) some of which users are not aware. As long as they use it as a black box it works fine, however, as soon as they need to tune any part of that pipeline (which is my experience happens often) problems start:
|
I guess this risk is not trivial :)
That's because there're 17 parameters in
I think this is a seperate issue? (e.g., split the vectorizer into several classes?)
Not sure whether this is a good reason. I think the doc is clear enough. "TfidfVectorizer is equivalent to CountVectorizer followed by TfidfTransformer." |
Shouldn't adding a |
+1 for adding copy to TfidfTransformer. PR welcome
|
Closing as it's unlikely to happen. |
Do you remember why this was deemed "unlikely to happen" @rth? My understanding is that in general we would be OK to add a |
I was just closing my PRs that I didn't see being merged in the near future. But overall yes, I think it would still be good if this happened, and since there is indeed no specific issue for it and all discussion is here I'll re-open.
That and having a reasonable way to get feature names in a |
Partially addresses #14951
Tentatively deprecates
TfidfVectorizer
in favor of usingCountVectorizer
in a pipeline withTfidfTransformer
. The main argument in favour of the deprecation is the complexity of the API and the number of parameters in theTfidfVectorizer
class. That would also make the use ofCountVectorizer
andHashingVectorizer
more symetric.This will affect a lot of uses as
TfidfVectorizer
is frequently used. At least 3 approvals would probably be good as this is potentially controversial.TODO: