Thanks to visit codestin.com
Credit goes to github.com

Skip to content

TfidfVectorizer sorts by term frequency instead of the tfidf score when using max_features #8265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sergeyf opened this issue Feb 1, 2017 · 14 comments

Comments

@sergeyf
Copy link
Contributor

sergeyf commented Feb 1, 2017

Description

When using the TfidfVectorizer with max_features=N (where N is not None), I would expect the algorithm to sort by the tfidf score and then take the top N features. Instead, it sorts by document frequency. I think this is not the expected behavior, even though it's properly documented.

@jnothman
Copy link
Member

jnothman commented Feb 1, 2017

Well, selecting a subset of features cannot depend on tf... but are you arguing whether it should be selecting those with the minimum vs maximum DF?

@sergeyf
Copy link
Contributor Author

sergeyf commented Feb 2, 2017

Hmm, maybe I'm missing something. I'm imagining a use case where the filtering is done after fitting for the purpose of, say, document classification. So I:

(1) Split my data into training and testing.
(2) Instantiate TfidfVectorizer.
(3) Train it on the training data.
(4) Keep only max_features token with the top N tf/idf scores.
(5) Transform and train logistic multinomial regression.

Maybe this approach isn't what this featurizer is designed for?

@jnothman
Copy link
Member

jnothman commented Feb 2, 2017

Do you mean keep only max_features per document rather than max_features for all documents?

@sergeyf
Copy link
Contributor Author

sergeyf commented Feb 2, 2017

No, I mean I want one big dictionary that is extracted from the training set. So the total dimensionality is N.

@jnothman
Copy link
Member

jnothman commented Feb 2, 2017

I still don't really see why it's useful for tf to play a part here. Tell me what kinds of words you would want to keep with your strategy as opposed to that implemented.

@sergeyf
Copy link
Contributor Author

sergeyf commented Feb 2, 2017

Oh, hey you're right. I made a mistake when I was thinking about this. Sorry for the trouble!

@sergeyf sergeyf closed this as completed Feb 2, 2017
@jnothman
Copy link
Member

jnothman commented Feb 2, 2017 via email

@flaviozamponi
Copy link

I'm sorry to intervene on a closed subject, but I have the same doubts of sergeyf and this discussion couldn't help me.
I try to explain it: If I use the TfidfVectorizer I'm interested in the results coming out of it. So I would expect that the max_features=N (with N not None) should give me the terms with the highest tfidf value and not "ordered by term frequency across the corpus" (citing the docs).
Put in other words: in an english text I will have plenty of "the", so the term frequency across the corpus will be rather high (probably one of the highest), even though "the" is a completely uninformative term. If I now use TfidfVectorizer with max_features=N, with N=100 (and I am not using the stop_words parameter) "the" will be in the results with a rather high value because the term frequency across the corpus is high. On the contrary, having the max_features ranked according the tfidf value "the" would be extremely low, since its idf value is very low and it would not appear at all in the list.
Am I missing something?

@chwonghk01
Copy link

Totally agree with @flaviozamponi , sort by term frequency would make digits, stop words, etc dominating the, says, top 100 features which are not informative. They should be the top 100 features sorted by TF-IDF score. Suggest reopening this issue for discussion.

@jnothman
Copy link
Member

They should be the top 100 features sorted by TF-IDF score.

That doesn't mean anything. TF-IDF provides a value per document, not a value per dataset. At the dataset level, all you have is IDF (or some currently unspecified aggregate of TF.IDF values across the dataset), which either selects the most frequent or the most infrequent terms to be kept.

I would welcome a PR deprecating max_features for removal as it is ambiguous. Diverse functionality can be achieved with a pipeline of CountVectorizer, TfidfTransformer, GenericUnivariateSelect.

@chwonghk01
Copy link

@jnothman Thanks for correcting me. I agree that using CountVectorizer, TfidfTransformer, GenericUnivariateSelect is a better choice.

@Ashz11
Copy link

Ashz11 commented Feb 25, 2020

One thing which I did was to

  1. fit the TfIdfVectorizer() on the data and get the feature names and TF -IDF values.
  2. used np.argsort() on the TF-IDF values and got the indices in decreasing order (Highest first)
  3. extracted the words from the get_feature_names() and filtered them with indices obtained from above.
  4. again fit the TfIdfVectorizer() on train data with setting vocabulary param with the words obtained from above.

@LydiaKoil
Copy link

@jnothman Thanks for correcting me. I agree that using CountVectorizer, TfidfTransformer, GenericUnivariateSelect is a better choice.

Can we do GenericUnivariateSelect directly after Tfidf Vectoriser rather than using TfidfTransformer on CountVectoriser

@khughitt
Copy link

At the dataset level, all you have is IDF (or some currently unspecified aggregate of TF.IDF values across the dataset)

Do you think it would be feasible, given the current architecture, to allow selecting terms based on such an aggregation of TF-IDF values across the dataset (e.g. sum, mean, or median)?

It seems like max_features is not intended to do this, and as such, it should probably be implemented separately, if at all, but I mention it here since there has already been an active discussion about the topic.

The basic motivation/intuition is the same as what others have suggested in various forms above:

  • It would be nice to be able to choose a subset of ~"informative" terms
  • TF-IDF helps to differentiate between more or less informative words at the document level
  • Just using word frequency (max_features) tends just choose common words (~missed stop words), which is not very helpful for understanding the similarities/differences between documents.

Since this is probably a common goal/use case, if the current code is amenable to such an aggregation -> term selection approach, it could be a really nice feature to have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants