Thanks to visit codestin.com
Credit goes to github.com

Skip to content

In feature extraction from text, what is the purpose stop_words_ attribute? #4032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lmichelbacher opened this issue Dec 30, 2014 · 11 comments
Closed

Comments

@lmichelbacher
Copy link
Contributor

I have a feature extractor that uses max_features.

With a reasonably large collection, serializing the feature extractor can lead to unnecessarily large files because of the attribute stop_words [1]. It contains the features that don't make into the vocabulary (for which max_features is a possible reason).

As far as I can see, it's never used anywhere so I was wondering if I could get rid of it (i.e. set to None) before serialization. This shouldn't change the feature extractor's behavior after serialization.

[1] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L804

@raghavrv
Copy link
Member

Perhaps defining __getstate__, __setstate__ methods and adding an option to exclude stop_words_ while pickling could be useful.
This should be done without breaking those which have been already picked... @ogrisel thoughts?

@jnothman
Copy link
Member

It's provided for those users that want it (and some may want it after
serialization), but no, it's not used in transform. It's fine to do
delattr(myvectorizer, 'stop_words_') before serialization to lighten the
load.

On 31 December 2014 at 03:46, ragv [email protected] wrote:

Perhaps defining getstate, setstate methods and adding an option
to exclude stop_words_ while pickling could be useful.
This should be done without breaking those which have been already
picked... @ogrisel https://github.com/ogrisel thoughts?


Reply to this email directly or view it on GitHub
#4032 (comment)
.

@jnothman
Copy link
Member

Or set it to None, equally. If you think this or similar is a worthwhile
note to add to the documentation, feel free to add a PR (and even a test to
ensure that this remains possible under changes to the vectorizer
implementation).

On 31 December 2014 at 10:47, Joel Nothman [email protected] wrote:

It's provided for those users that want it (and some may want it after
serialization), but no, it's not used in transform. It's fine to do
delattr(myvectorizer, 'stop_words_') before serialization to lighten the
load.

On 31 December 2014 at 03:46, ragv [email protected] wrote:

Perhaps defining getstate, setstate methods and adding an
option to exclude stop_words_ while pickling could be useful.
This should be done without breaking those which have been already
picked... @ogrisel https://github.com/ogrisel thoughts?


Reply to this email directly or view it on GitHub
#4032 (comment)
.

@amueller
Copy link
Member

we could add an option for not storing stop_words_.... I would not mess with the setstate and getstate currently.

@raghavrv
Copy link
Member

It's fine to do delattr(myvectorizer, 'stop_words_') before serialization to lighten the load.

Or set it to None, equally. ............ a test to ensure that this remains possible under changes to the vectorizer implementation

we could add an option for not storing stop_words_.... I would not mess with the setstate and getstate currently.

Have sent a PR #4037 for the above.

If you think this or similar is a worthwhile note to add to the documentation

I feel this needs to be documented for all modules commonly under model persistence...?

@amueller
Copy link
Member

What would you document for all models? Most attributes are necessary for prediction and it is unclear from the documentation which are needed and which not.

@raghavrv
Copy link
Member

it is unclear from the documentation which are needed and which not

Ah.... So respective doc needs to be updated with that information?

@jnothman
Copy link
Member

No need to go overboard. If there's a real concern and we know users will
benefit from the knowledge, then we can document it on a case-by-case
basis. Most attribs that are unused at transform/predict time are not
memory hogs, or are only memory hogs in certain cases. Properly documenting
that will be arduous and mostly bloat.

On 31 December 2014 at 14:07, ragv [email protected] wrote:

it is unclear from the documentation which are needed and which not

Ah.... So respective doc needs to be updated with that information?


Reply to this email directly or view it on GitHub
#4032 (comment)
.

@lmichelbacher
Copy link
Contributor Author

Do people agree that the documentation of the attribute should be amended to include max_features? The docs suggest that only min/max_df influence it.

    stop_words_ : set
        Terms that were ignored because
        they occurred in either too many
        (`max_df`) or in too few (`min_df`) documents.
        This is only available if no vocabulary was given.

In general, I'm curious about the use case for this attribute. It essentially tells you about the features extracted from training data that you didn't use. In the context of transforming unseen documents, what's the benefit of knowing what you're not going to use? Or is the use case more in the training context where you might want to do some analysis on the features that were discarded?

@jnothman
Copy link
Member

jnothman commented Jan 1, 2015

It's not especially relevant to transforming documents, but allows you to perform manual inspection to determine if the parameters are reasonable, or to explain a failure to model the data well.

And yes, it should be modified to refer to dependence on max_features. Please submit a PR.

@jnothman
Copy link
Member

jnothman commented Feb 1, 2015

Addressed in #4042

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants