Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] API make sure vectorizers read data from file before analyzing #13641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Apr 23, 2019

Conversation

adrinjalali
Copy link
Member

Fixes #5482

If the given analyzer is a calable, it seems reasonable to assume if input='file' or input='filename', the data should be read from the file first, and then passed to the analyzer, the same way as it's done for non-callable analyzers.

This PR clarifies this in the docstrings, and passes the "decoded" input to the analyzer. It should be less of a concern regarding the input on the bytes vs str since we don't support python2 anymore.

I'm not entirely sure if this is what we wanna do, it's more of a proposal to move it forward.

After this PR, the following would result in a FileNotFoundError exception:

cv = CountVectorizer(analyzer=lambda x: x.split(), input='filename')
cv.fit(['hello world']).vocabulary_

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need a deprecation period before changing this? Previously there was no benefit in specifying input with a custom analyzer, but users still might have done it??? Or do we not care if they relied on buggy (not as documented) behaviour? Otherwise I think this is good.

@adrinjalali
Copy link
Member Author

I think regarding custom analyzers there are two issues here which we should separate from one another:

  • The user sets input='file', but passes a string
  • The user expects the file or the filename in their custom analyzer

To fix the first one, we can simply check if the input is a file object, or a valid existing file name, and not do more.

For the second issue, this PR is changing the behavior and I don't see a backward compatible way w/ deprecation cycle of doing it w/o an extra parameter.

I personally think this PR brings the handling of a custom analyzer to what people would expect(?), but I also understand the other side saying this is a change in behavior and not merely fixing a bug.

@jnothman
Copy link
Member

jnothman commented Apr 15, 2019 via email

@adrinjalali
Copy link
Member Author

I think this is a bug fix too. Is there a way we can issue a ChangedBehaviorWarning in appropriate circumstances (at least if the analyzer breaks)?

I think there are some edge cases where we may not detect the issue, but this probably covers most cases. Of course if the given custom analyzer actually catches those exceptions, we can't do much.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm. What's new?

@adrinjalali
Copy link
Member Author

I think the warning also needs to be raised if input='file' and the analyzer is passed a file which it tries to treat as a string

I think with this PR, the analyzer would never be given a file object. I'm not sure if I understand you correctly here.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. If won't be given the file object. Tired. Thankful you're tackling these ancient issues. Probably not in a state to be reviewing!

@jnothman jnothman self-requested a review April 15, 2019 14:33
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm fine with this.

What's new? Emphasise that it may break current usage?

@adrinjalali adrinjalali added this to the 0.21 milestone Apr 16, 2019
@adrinjalali adrinjalali changed the title make sure vectorizers read data from file before analyzing [MRG+1] API make sure vectorizers read data from file before analyzing Apr 16, 2019
"and not the file names or the file objects. This warning "
"will be removed in v0.23.")
try:
self.analyzer(fname)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If input="file" why are we checking that analyser works with filenames? fname is a string, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're just making sure that the analyzer neither tries to read from a file object, nor try to open the non-existing file.

@glemaitre glemaitre self-requested a review April 19, 2019 14:54
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I style change in the tests which should solve the issue pointed out by @rth.
Otherwise LGTM.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like comments have been addressed, lgtm

@jnothman jnothman dismissed glemaitre’s stale review April 23, 2019 03:48

Requests have been fulfilled

@jnothman jnothman merged commit 70fd42e into scikit-learn:master Apr 23, 2019
@jnothman
Copy link
Member

Yay! Thanks @adrinjalali for digging this one up :)

@adrinjalali adrinjalali deleted the countvectorizer/fileinput branch April 23, 2019 06:52
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CountVectorizer with custom analyzer ignores input argument
5 participants