Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] Test to make sure deletion of stop_words_ does not affect transformation. #4037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

raghavrv
Copy link
Member

Addresses #4032

  • Tests to make sure excluding stop_words_ does not affect transforming.

@jnothman @amueller Please take a look...

@jnothman
Copy link
Member

I'm -1 on this feature. The additional parameter just adds noise to what is already a huge and confusing parameter list on CountVectorizer, let alone TfidfVecotrizer (which I think you've not yet modified). We similarly don't have a parameter to automatically run sparsify on a linear model. The only case I can think of where this provides great benefit over deleting the attribute manually is where the entire model is cached. But if the user really cares that much, they can easily extend the class via inheritance.

@raghavrv
Copy link
Member Author

So these tests would suffice for now?

@@ -857,6 +874,27 @@ def test_pickling_vectorizer():
assert_array_equal(
copy.fit_transform(JUNK_FOOD_DOCS).toarray(),
orig.fit_transform(JUNK_FOOD_DOCS).toarray())

# Ensure that deleting the stop_words_ attribute doesn't affect pickling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't really need a pickle in there. Just need to test that deletion doesn't affect transformation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we recommend to do so to users, that is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! Also just realized that transform calls fit_transform which is the method that sets the stop_word_ attribute... so should we need a test at all?

@raghavrv raghavrv force-pushed the make_storing_stop_words_optional branch 2 times, most recently from 791366c to b9a2a54 Compare December 31, 2014 09:56
@raghavrv
Copy link
Member Author

@jnothman Does this look okay...?

Also I would like to note that transform in both TfidfVectorizer and CountVectorizer calls fit_transform which is the method that sets the stop_words_ attributes. Hence I am not sure if this test is necessary... ( i.e even if stop_words_ is deleted (or set to None), it is restored back when calling transform ). Please take a look...
If you find it not necessary feel free to close this PR ( and also sorry for the additional noise by raising this PR without investigating properly :p )

@raghavrv raghavrv changed the title [MRG] Make storing stop words optional [MRG] Test to make sure deletion of stop_words_ does not affect transformation. Dec 31, 2014
@jnothman
Copy link
Member

Also I would like to note that transform in both TfidfVectorizer and CountVectorizer calls fit_transform which is the method that sets the stop_words_ attributes

Not as far as I can see!

@raghavrv
Copy link
Member Author

Ah sorry for the blunder, it was fit which called fit_transform not the transform... :p

@raghavrv
Copy link
Member Author

sorry for the confusion :)

@raghavrv raghavrv force-pushed the make_storing_stop_words_optional branch 3 times, most recently from b2e899a to a8fa121 Compare January 4, 2015 00:21
@@ -857,6 +860,19 @@ def test_pickling_vectorizer():
assert_array_equal(
copy.fit_transform(JUNK_FOOD_DOCS).toarray(),
orig.fit_transform(JUNK_FOOD_DOCS).toarray())

# Ensure that deleting the stop_words_ attribute doesn't affect transform
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make a separate test. This is no longer about pickling, and the current approach makes the above loop through instances more confusing.

@jnothman
Copy link
Member

jnothman commented Jan 7, 2015

Please add a Notes section to the {Count,Tfidf}Vectorizer where you mention that stop_words_ can get large, but can be removed before pickling.

@raghavrv raghavrv force-pushed the make_storing_stop_words_optional branch from a8fa121 to 7c02fe0 Compare January 7, 2015 14:17
@raghavrv
Copy link
Member Author

raghavrv commented Jan 7, 2015

@jnothman Done! BTW why is Notes section not available in the html?

@amueller
Copy link
Member

amueller commented Jan 7, 2015

+0 on adding the tests....

The notes should show up in html, probably malformed docstrings.

@raghavrv
Copy link
Member Author

raghavrv commented Jan 7, 2015

@amueller Do you feel the test is unnecessary?

@amueller
Copy link
Member

amueller commented Jan 7, 2015

It is not even documented that you can remove it, right? I would rather document it than test for it. Or maybe document and also test. Then the test would ensure the documentation is correct.

@raghavrv
Copy link
Member Author

raghavrv commented Jan 7, 2015

@amueller Does this change look okay?


Notes
-----
The ``stop_words_`` attribute can get large which may not be desirable when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would document it at the attribute documentation, saying "only for introspection and can savely be removed before pickling" or something like that.

@raghavrv raghavrv force-pushed the make_storing_stop_words_optional branch 3 times, most recently from 75ef655 to 427fb5f Compare January 7, 2015 21:19
@raghavrv raghavrv changed the title [MRG] Test to make sure deletion of stop_words_ does not affect transformation. [MRG+1] Test to make sure deletion of stop_words_ does not affect transformation. Jan 7, 2015
@jnothman
Copy link
Member

jnothman commented Jan 7, 2015

I thought documenting it at the model level made more sense because users aren't going to looking for things that can be removed. They're going to look for "why is this model so big?". But not too concerned either way.

@amueller
Copy link
Member

amueller commented Jan 7, 2015

I don't have a strong opinion either way, sorry if I created work.

@raghavrv
Copy link
Member Author

raghavrv commented Jan 7, 2015

Should I revert back to old form? ( Notes )

@raghavrv raghavrv force-pushed the make_storing_stop_words_optional branch from 427fb5f to 0875593 Compare January 11, 2015 10:28

This attribute is provided only for introspection and can be safely
removed using delattr or set to None before pickling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman Should I revert this to Notes?

@raghavrv raghavrv force-pushed the make_storing_stop_words_optional branch 2 times, most recently from 070f0b1 to 416fea3 Compare January 11, 2015 11:43
@raghavrv
Copy link
Member Author

I'll squash the 2nd and 3rd commits once the 3rd is approved!

@raghavrv
Copy link
Member Author

Do you think the tests are moot? perhaps we could have the documentation alone?

( In which case I'll delete 1st commit and squash 2 and 3 ) ?

@amueller
Copy link
Member

As I said, I don't have a strong opinion.

@raghavrv raghavrv force-pushed the make_storing_stop_words_optional branch from 5c81e60 to 155d575 Compare January 14, 2015 20:49
@raghavrv
Copy link
Member Author

@jnothman your view on whether the test feels unnecessary?

@jnothman
Copy link
Member

I think that if it's documented, the test is worthwhile. Don't promise your
users something without checking you can fulfil it.

On 16 January 2015 at 10:14, ragv [email protected] wrote:

@jnothman https://github.com/jnothman your view on whether the test
feels unnecessary?


Reply to this email directly or view it on GitHub
#4037 (comment)
.

@raghavrv
Copy link
Member Author

Okay! thanks for the quick reply! :)

@amueller
Copy link
Member

ok then lets merge.

@raghavrv raghavrv force-pushed the make_storing_stop_words_optional branch from 155d575 to f71248a Compare January 16, 2015 18:19
@raghavrv raghavrv force-pushed the make_storing_stop_words_optional branch 2 times, most recently from 70c7435 to 43ad91a Compare January 28, 2015 19:16
@raghavrv
Copy link
Member Author

raghavrv commented Feb 1, 2015

@jnothman Could this be closed / merged ? ( If its good for merge, please let me know I'll squash all the commits to one) I squashed it anyway...

@raghavrv raghavrv force-pushed the make_storing_stop_words_optional branch 2 times, most recently from ff36cb5 to cd8bd93 Compare February 1, 2015 18:58
DOC Add a line to {Count, Tfidf}Vectorizer about removal of stop_words_
DOC Add documentation of stop_words_ attr in TfidfVectorizer
@raghavrv raghavrv force-pushed the make_storing_stop_words_optional branch from cd8bd93 to 21369dd Compare February 1, 2015 20:50
@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) to 94.79% when pulling 21369dd on ragv:make_storing_stop_words_optional into 94157fa on scikit-learn:master.

jnothman added a commit that referenced this pull request Feb 1, 2015
[MRG] Test to make sure deletion of `stop_words_` does not affect transformation.
@jnothman jnothman merged commit eb52bb3 into scikit-learn:master Feb 1, 2015
@jnothman
Copy link
Member

jnothman commented Feb 1, 2015

Thanks @ragv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants