TfidfVectorizer sorts by term frequency instead of the tfidf score when using `max_features` #8265

sergeyf · 2017-02-01T23:22:04Z

Description

When using the TfidfVectorizer with max_features=N (where N is not None), I would expect the algorithm to sort by the tfidf score and then take the top N features. Instead, it sorts by document frequency. I think this is not the expected behavior, even though it's properly documented.

The text was updated successfully, but these errors were encountered:

jnothman · 2017-02-01T23:37:07Z

Well, selecting a subset of features cannot depend on tf... but are you arguing whether it should be selecting those with the minimum vs maximum DF?

sergeyf · 2017-02-02T00:11:43Z

Hmm, maybe I'm missing something. I'm imagining a use case where the filtering is done after fitting for the purpose of, say, document classification. So I:

(1) Split my data into training and testing.
(2) Instantiate TfidfVectorizer.
(3) Train it on the training data.
(4) Keep only max_features token with the top N tf/idf scores.
(5) Transform and train logistic multinomial regression.

Maybe this approach isn't what this featurizer is designed for?

jnothman · 2017-02-02T00:20:12Z

Do you mean keep only max_features per document rather than max_features for all documents?

sergeyf · 2017-02-02T00:21:59Z

No, I mean I want one big dictionary that is extracted from the training set. So the total dimensionality is N.

jnothman · 2017-02-02T00:25:44Z

I still don't really see why it's useful for tf to play a part here. Tell me what kinds of words you would want to keep with your strategy as opposed to that implemented.

sergeyf · 2017-02-02T00:58:24Z

Oh, hey you're right. I made a mistake when I was thinking about this. Sorry for the trouble!

jnothman · 2017-02-02T01:43:37Z

No worries! There's been no shortage of concerns about our TFIDF implementation, so I'm glad it's right on this front!

…

On 2 February 2017 at 11:58, Sergey Feldman ***@***.***> wrote: Oh, hey you're right. I made a mistake when I was thinking about this. Sorry for the trouble! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8265 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6w74kE42voEDW0_WQuHukx6JWrlAks5rYSoxgaJpZM4L0hX0> .

flaviozamponi · 2018-02-01T07:16:47Z

I'm sorry to intervene on a closed subject, but I have the same doubts of sergeyf and this discussion couldn't help me.
I try to explain it: If I use the TfidfVectorizer I'm interested in the results coming out of it. So I would expect that the max_features=N (with N not None) should give me the terms with the highest tfidf value and not "ordered by term frequency across the corpus" (citing the docs).
Put in other words: in an english text I will have plenty of "the", so the term frequency across the corpus will be rather high (probably one of the highest), even though "the" is a completely uninformative term. If I now use TfidfVectorizer with max_features=N, with N=100 (and I am not using the stop_words parameter) "the" will be in the results with a rather high value because the term frequency across the corpus is high. On the contrary, having the max_features ranked according the tfidf value "the" would be extremely low, since its idf value is very low and it would not appear at all in the list.
Am I missing something?

chwonghk01 · 2019-08-20T04:21:25Z

Totally agree with @flaviozamponi , sort by term frequency would make digits, stop words, etc dominating the, says, top 100 features which are not informative. They should be the top 100 features sorted by TF-IDF score. Suggest reopening this issue for discussion.

jnothman · 2019-08-20T08:16:39Z

They should be the top 100 features sorted by TF-IDF score.

That doesn't mean anything. TF-IDF provides a value per document, not a value per dataset. At the dataset level, all you have is IDF (or some currently unspecified aggregate of TF.IDF values across the dataset), which either selects the most frequent or the most infrequent terms to be kept.

I would welcome a PR deprecating max_features for removal as it is ambiguous. Diverse functionality can be achieved with a pipeline of CountVectorizer, TfidfTransformer, GenericUnivariateSelect.

chwonghk01 · 2019-08-30T03:23:57Z

@jnothman Thanks for correcting me. I agree that using CountVectorizer, TfidfTransformer, GenericUnivariateSelect is a better choice.

Ashz11 · 2020-02-25T14:54:02Z

One thing which I did was to

fit the TfIdfVectorizer() on the data and get the feature names and TF -IDF values.
used np.argsort() on the TF-IDF values and got the indices in decreasing order (Highest first)
extracted the words from the get_feature_names() and filtered them with indices obtained from above.
again fit the TfIdfVectorizer() on train data with setting vocabulary param with the words obtained from above.

LydiaKoil · 2021-03-04T11:23:31Z

@jnothman Thanks for correcting me. I agree that using CountVectorizer, TfidfTransformer, GenericUnivariateSelect is a better choice.

Can we do GenericUnivariateSelect directly after Tfidf Vectoriser rather than using TfidfTransformer on CountVectoriser

khughitt · 2022-03-26T22:26:37Z

At the dataset level, all you have is IDF (or some currently unspecified aggregate of TF.IDF values across the dataset)

Do you think it would be feasible, given the current architecture, to allow selecting terms based on such an aggregation of TF-IDF values across the dataset (e.g. sum, mean, or median)?

It seems like max_features is not intended to do this, and as such, it should probably be implemented separately, if at all, but I mention it here since there has already been an active discussion about the topic.

The basic motivation/intuition is the same as what others have suggested in various forms above:

It would be nice to be able to choose a subset of ~"informative" terms
TF-IDF helps to differentiate between more or less informative words at the document level
Just using word frequency (max_features) tends just choose common words (~missed stop words), which is not very helpful for understanding the similarities/differences between documents.

Since this is probably a common goal/use case, if the current code is amenable to such an aggregation -> term selection approach, it could be a really nice feature to have.

sergeyf closed this as completed Feb 2, 2017

jnothman mentioned this issue Sep 13, 2019

Deprecate TfidfVectorizer #14966

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

TfidfVectorizer sorts by term frequency instead of the tfidf score when using `max_features` #8265

TfidfVectorizer sorts by term frequency instead of the tfidf score when using `max_features` #8265

sergeyf commented Feb 1, 2017

jnothman commented Feb 1, 2017

Uh oh!

sergeyf commented Feb 2, 2017

Uh oh!

jnothman commented Feb 2, 2017

Uh oh!

sergeyf commented Feb 2, 2017

Uh oh!

jnothman commented Feb 2, 2017

Uh oh!

sergeyf commented Feb 2, 2017

Uh oh!

jnothman commented Feb 2, 2017 via email

Uh oh!

flaviozamponi commented Feb 1, 2018

Uh oh!

chwonghk01 commented Aug 20, 2019

Uh oh!

jnothman commented Aug 20, 2019

Uh oh!

chwonghk01 commented Aug 30, 2019

Uh oh!

Ashz11 commented Feb 25, 2020

Uh oh!

LydiaKoil commented Mar 4, 2021

Uh oh!

khughitt commented Mar 26, 2022

Uh oh!

Uh oh!

TfidfVectorizer sorts by term frequency instead of the tfidf score when using max_features #8265

TfidfVectorizer sorts by term frequency instead of the tfidf score when using max_features #8265

Comments

sergeyf commented Feb 1, 2017

Description

jnothman commented Feb 1, 2017

Uh oh!

sergeyf commented Feb 2, 2017

Uh oh!

jnothman commented Feb 2, 2017

Uh oh!

sergeyf commented Feb 2, 2017

Uh oh!

jnothman commented Feb 2, 2017

Uh oh!

sergeyf commented Feb 2, 2017

Uh oh!

jnothman commented Feb 2, 2017 via email

Uh oh!

flaviozamponi commented Feb 1, 2018

Uh oh!

chwonghk01 commented Aug 20, 2019

Uh oh!

jnothman commented Aug 20, 2019

Uh oh!

chwonghk01 commented Aug 30, 2019

Uh oh!

Ashz11 commented Feb 25, 2020

Uh oh!

LydiaKoil commented Mar 4, 2021

Uh oh!

khughitt commented Mar 26, 2022

Uh oh!

TfidfVectorizer sorts by term frequency instead of the tfidf score when using `max_features` #8265

TfidfVectorizer sorts by term frequency instead of the tfidf score when using `max_features` #8265