-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
In feature extraction from text, what is the purpose stop_words_ attribute? #4032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Perhaps defining |
It's provided for those users that want it (and some may want it after On 31 December 2014 at 03:46, ragv [email protected] wrote:
|
Or set it to None, equally. If you think this or similar is a worthwhile On 31 December 2014 at 10:47, Joel Nothman [email protected] wrote:
|
we could add an option for not storing |
Have sent a PR #4037 for the above.
I feel this needs to be documented for all modules commonly under model persistence...? |
What would you document for all models? Most attributes are necessary for prediction and it is unclear from the documentation which are needed and which not. |
Ah.... So respective doc needs to be updated with that information? |
No need to go overboard. If there's a real concern and we know users will On 31 December 2014 at 14:07, ragv [email protected] wrote:
|
Do people agree that the documentation of the attribute should be amended to include max_features? The docs suggest that only min/max_df influence it.
In general, I'm curious about the use case for this attribute. It essentially tells you about the features extracted from training data that you didn't use. In the context of transforming unseen documents, what's the benefit of knowing what you're not going to use? Or is the use case more in the training context where you might want to do some analysis on the features that were discarded? |
It's not especially relevant to transforming documents, but allows you to perform manual inspection to determine if the parameters are reasonable, or to explain a failure to model the data well. And yes, it should be modified to refer to dependence on |
Addressed in #4042 |
I have a feature extractor that uses
max_features
.With a reasonably large collection, serializing the feature extractor can lead to unnecessarily large files because of the attribute
stop_words
[1]. It contains the features that don't make into the vocabulary (for whichmax_features
is a possible reason).As far as I can see, it's never used anywhere so I was wondering if I could get rid of it (i.e. set to
None
) before serialization. This shouldn't change the feature extractor's behavior after serialization.[1] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L804
The text was updated successfully, but these errors were encountered: