-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] API make sure vectorizers read data from file before analyzing #13641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] API make sure vectorizers read data from file before analyzing #13641
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we need a deprecation period before changing this? Previously there was no benefit in specifying input
with a custom analyzer, but users still might have done it??? Or do we not care if they relied on buggy (not as documented) behaviour? Otherwise I think this is good.
I think regarding custom analyzers there are two issues here which we should separate from one another:
To fix the first one, we can simply check if the input is a file object, or a valid existing file name, and not do more. For the second issue, this PR is changing the behavior and I don't see a backward compatible way w/ deprecation cycle of doing it w/o an extra parameter. I personally think this PR brings the handling of a custom analyzer to what people would expect(?), but I also understand the other side saying this is a change in behavior and not merely fixing a bug. |
I think this is a bug fix too. Is there a way we can issue a
ChangedBehaviorWarning in appropriate circumstances (at least if the
analyzer breaks)?
|
I think there are some edge cases where we may not detect the issue, but this probably covers most cases. Of course if the given custom analyzer actually catches those exceptions, we can't do much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise lgtm. What's new?
I think with this PR, the analyzer would never be given a file object. I'm not sure if I understand you correctly here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. If won't be given the file object. Tired. Thankful you're tackling these ancient issues. Probably not in a state to be reviewing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'm fine with this.
What's new? Emphasise that it may break current usage?
Co-Authored-By: adrinjalali <[email protected]>
…ikit-learn into countvectorizer/fileinput
"and not the file names or the file objects. This warning " | ||
"will be removed in v0.23.") | ||
try: | ||
self.analyzer(fname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If input="file"
why are we checking that analyser works with filenames? fname
is a string, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're just making sure that the analyzer neither tries to read from a file object, nor try to open the non-existing file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I style change in the tests which should solve the issue pointed out by @rth.
Otherwise LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like comments have been addressed, lgtm
Yay! Thanks @adrinjalali for digging this one up :) |
scikit-learn#13641)" This reverts commit 9010d9f.
scikit-learn#13641)" This reverts commit 9010d9f.
Fixes #5482
If the given analyzer is a calable, it seems reasonable to assume if
input='file'
orinput='filename'
, the data should be read from the file first, and then passed to the analyzer, the same way as it's done for non-callable analyzers.This PR clarifies this in the docstrings, and passes the "decoded" input to the analyzer. It should be less of a concern regarding the input on the bytes vs str since we don't support python2 anymore.
I'm not entirely sure if this is what we wanna do, it's more of a proposal to move it forward.
After this PR, the following would result in a
FileNotFoundError
exception: