-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
TfidfVectorizer sorts by term frequency instead of the tfidf score when using max_features
#8265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Well, selecting a subset of features cannot depend on tf... but are you arguing whether it should be selecting those with the minimum vs maximum DF? |
Hmm, maybe I'm missing something. I'm imagining a use case where the filtering is done after fitting for the purpose of, say, document classification. So I: (1) Split my data into training and testing. Maybe this approach isn't what this featurizer is designed for? |
Do you mean keep only |
No, I mean I want one big dictionary that is extracted from the training set. So the total dimensionality is |
I still don't really see why it's useful for tf to play a part here. Tell me what kinds of words you would want to keep with your strategy as opposed to that implemented. |
Oh, hey you're right. I made a mistake when I was thinking about this. Sorry for the trouble! |
No worries! There's been no shortage of concerns about our TFIDF
implementation, so I'm glad it's right on this front!
…On 2 February 2017 at 11:58, Sergey Feldman ***@***.***> wrote:
Oh, hey you're right. I made a mistake when I was thinking about this.
Sorry for the trouble!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8265 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6w74kE42voEDW0_WQuHukx6JWrlAks5rYSoxgaJpZM4L0hX0>
.
|
I'm sorry to intervene on a closed subject, but I have the same doubts of sergeyf and this discussion couldn't help me. |
Totally agree with @flaviozamponi , sort by term frequency would make digits, stop words, etc dominating the, says, top 100 features which are not informative. They should be the top 100 features sorted by TF-IDF score. Suggest reopening this issue for discussion. |
That doesn't mean anything. TF-IDF provides a value per document, not a value per dataset. At the dataset level, all you have is IDF (or some currently unspecified aggregate of TF.IDF values across the dataset), which either selects the most frequent or the most infrequent terms to be kept. I would welcome a PR deprecating |
@jnothman Thanks for correcting me. I agree that using |
One thing which I did was to
|
Can we do GenericUnivariateSelect directly after Tfidf Vectoriser rather than using TfidfTransformer on CountVectoriser |
Do you think it would be feasible, given the current architecture, to allow selecting terms based on such an aggregation of TF-IDF values across the dataset (e.g. It seems like The basic motivation/intuition is the same as what others have suggested in various forms above:
Since this is probably a common goal/use case, if the current code is amenable to such an aggregation -> term selection approach, it could be a really nice feature to have. |
Description
When using the
TfidfVectorizer
withmax_features=N
(whereN
is notNone
), I would expect the algorithm to sort by the tfidf score and then take the topN
features. Instead, it sorts by document frequency. I think this is not the expected behavior, even though it's properly documented.The text was updated successfully, but these errors were encountered: