In feature extraction from text, what is the purpose stop_words_ attribute? #4032

lmichelbacher · 2014-12-30T13:38:36Z

I have a feature extractor that uses max_features.

With a reasonably large collection, serializing the feature extractor can lead to unnecessarily large files because of the attribute stop_words [1]. It contains the features that don't make into the vocabulary (for which max_features is a possible reason).

As far as I can see, it's never used anywhere so I was wondering if I could get rid of it (i.e. set to None) before serialization. This shouldn't change the feature extractor's behavior after serialization.

[1] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L804

The text was updated successfully, but these errors were encountered:

raghavrv · 2014-12-30T16:46:43Z

Perhaps defining __getstate__, __setstate__ methods and adding an option to exclude stop_words_ while pickling could be useful.
This should be done without breaking those which have been already picked... @ogrisel thoughts?

jnothman · 2014-12-30T23:47:49Z

It's provided for those users that want it (and some may want it after
serialization), but no, it's not used in transform. It's fine to do
delattr(myvectorizer, 'stop_words_') before serialization to lighten the
load.

On 31 December 2014 at 03:46, ragv [email protected] wrote:

Perhaps defining getstate, setstate methods and adding an option
to exclude stop_words_ while pickling could be useful.
This should be done without breaking those which have been already
picked... @ogrisel https://github.com/ogrisel thoughts?

—
Reply to this email directly or view it on GitHub
#4032 (comment)
.

jnothman · 2014-12-30T23:48:48Z

Or set it to None, equally. If you think this or similar is a worthwhile
note to add to the documentation, feel free to add a PR (and even a test to
ensure that this remains possible under changes to the vectorizer
implementation).

On 31 December 2014 at 10:47, Joel Nothman [email protected] wrote:

It's provided for those users that want it (and some may want it after
serialization), but no, it's not used in transform. It's fine to do
delattr(myvectorizer, 'stop_words_') before serialization to lighten the
load.

On 31 December 2014 at 03:46, ragv [email protected] wrote:

Perhaps defining getstate, setstate methods and adding an
option to exclude stop_words_ while pickling could be useful.
This should be done without breaking those which have been already
picked... @ogrisel https://github.com/ogrisel thoughts?

—
Reply to this email directly or view it on GitHub
#4032 (comment)
.

amueller · 2014-12-30T23:51:18Z

we could add an option for not storing stop_words_.... I would not mess with the setstate and getstate currently.

raghavrv · 2014-12-31T02:59:40Z

It's fine to do delattr(myvectorizer, 'stop_words_') before serialization to lighten the load.

Or set it to None, equally. ............ a test to ensure that this remains possible under changes to the vectorizer implementation

we could add an option for not storing stop_words_.... I would not mess with the setstate and getstate currently.

Have sent a PR #4037 for the above.

If you think this or similar is a worthwhile note to add to the documentation

I feel this needs to be documented for all modules commonly under model persistence...?

amueller · 2014-12-31T03:02:30Z

What would you document for all models? Most attributes are necessary for prediction and it is unclear from the documentation which are needed and which not.

raghavrv · 2014-12-31T03:07:34Z

it is unclear from the documentation which are needed and which not

Ah.... So respective doc needs to be updated with that information?

jnothman · 2014-12-31T03:17:59Z

No need to go overboard. If there's a real concern and we know users will
benefit from the knowledge, then we can document it on a case-by-case
basis. Most attribs that are unused at transform/predict time are not
memory hogs, or are only memory hogs in certain cases. Properly documenting
that will be arduous and mostly bloat.

On 31 December 2014 at 14:07, ragv [email protected] wrote:

it is unclear from the documentation which are needed and which not

Ah.... So respective doc needs to be updated with that information?

—
Reply to this email directly or view it on GitHub
#4032 (comment)
.

lmichelbacher · 2014-12-31T14:20:31Z

Do people agree that the documentation of the attribute should be amended to include max_features? The docs suggest that only min/max_df influence it.

    stop_words_ : set
        Terms that were ignored because
        they occurred in either too many
        (`max_df`) or in too few (`min_df`) documents.
        This is only available if no vocabulary was given.

In general, I'm curious about the use case for this attribute. It essentially tells you about the features extracted from training data that you didn't use. In the context of transforming unseen documents, what's the benefit of knowing what you're not going to use? Or is the use case more in the training context where you might want to do some analysis on the features that were discarded?

jnothman · 2015-01-01T02:01:57Z

It's not especially relevant to transforming documents, but allows you to perform manual inspection to determine if the parameters are reasonable, or to explain a failure to model the data well.

And yes, it should be modified to refer to dependence on max_features. Please submit a PR.

jnothman · 2015-02-01T22:55:32Z

Addressed in #4042

raghavrv mentioned this issue Dec 31, 2014

[MRG+1] Test to make sure deletion of stop_words_ does not affect transformation. #4037

Merged

1 task

lmichelbacher mentioned this issue Jan 2, 2015

[MRG+1] Add dependence on max_features to docstring #4042

Merged

jnothman closed this as completed Feb 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In feature extraction from text, what is the purpose stop_words_ attribute? #4032

In feature extraction from text, what is the purpose stop_words_ attribute? #4032

lmichelbacher commented Dec 30, 2014

raghavrv commented Dec 30, 2014

jnothman commented Dec 30, 2014

jnothman commented Dec 30, 2014

amueller commented Dec 30, 2014

raghavrv commented Dec 31, 2014

amueller commented Dec 31, 2014

raghavrv commented Dec 31, 2014

jnothman commented Dec 31, 2014

lmichelbacher commented Dec 31, 2014

jnothman commented Jan 1, 2015

jnothman commented Feb 1, 2015

In feature extraction from text, what is the purpose stop_words_ attribute? #4032

In feature extraction from text, what is the purpose stop_words_ attribute? #4032

Comments

lmichelbacher commented Dec 30, 2014

raghavrv commented Dec 30, 2014

jnothman commented Dec 30, 2014

jnothman commented Dec 30, 2014

amueller commented Dec 30, 2014

raghavrv commented Dec 31, 2014

amueller commented Dec 31, 2014

raghavrv commented Dec 31, 2014

jnothman commented Dec 31, 2014

lmichelbacher commented Dec 31, 2014

jnothman commented Jan 1, 2015

jnothman commented Feb 1, 2015