[MRG+1] API make sure vectorizers read data from file before analyzing #13641

adrinjalali · 2019-04-14T21:20:41Z

If the given analyzer is a calable, it seems reasonable to assume if input='file' or input='filename', the data should be read from the file first, and then passed to the analyzer, the same way as it's done for non-callable analyzers.

This PR clarifies this in the docstrings, and passes the "decoded" input to the analyzer. It should be less of a concern regarding the input on the bytes vs str since we don't support python2 anymore.

I'm not entirely sure if this is what we wanna do, it's more of a proposal to move it forward.

After this PR, the following would result in a FileNotFoundError exception:

cv = CountVectorizer(analyzer=lambda x: x.split(), input='filename')
cv.fit(['hello world']).vocabulary_

jnothman

Do you think we need a deprecation period before changing this? Previously there was no benefit in specifying input with a custom analyzer, but users still might have done it??? Or do we not care if they relied on buggy (not as documented) behaviour? Otherwise I think this is good.

adrinjalali · 2019-04-15T09:06:59Z

I think regarding custom analyzers there are two issues here which we should separate from one another:

The user sets input='file', but passes a string
The user expects the file or the filename in their custom analyzer

To fix the first one, we can simply check if the input is a file object, or a valid existing file name, and not do more.

For the second issue, this PR is changing the behavior and I don't see a backward compatible way w/ deprecation cycle of doing it w/o an extra parameter.

I personally think this PR brings the handling of a custom analyzer to what people would expect(?), but I also understand the other side saying this is a change in behavior and not merely fixing a bug.

jnothman · 2019-04-15T09:38:41Z

I think this is a bug fix too. Is there a way we can issue a ChangedBehaviorWarning in appropriate circumstances (at least if the analyzer breaks)?

adrinjalali · 2019-04-15T11:50:47Z

I think this is a bug fix too. Is there a way we can issue a ChangedBehaviorWarning in appropriate circumstances (at least if the analyzer breaks)?

I think there are some edge cases where we may not detect the issue, but this probably covers most cases. Of course if the given custom analyzer actually catches those exceptions, we can't do much.

jnothman

Otherwise lgtm. What's new?

sklearn/feature_extraction/text.py

adrinjalali · 2019-04-15T14:22:20Z

I think the warning also needs to be raised if input='file' and the analyzer is passed a file which it tries to treat as a string

I think with this PR, the analyzer would never be given a file object. I'm not sure if I understand you correctly here.

jnothman

You're right. If won't be given the file object. Tired. Thankful you're tackling these ancient issues. Probably not in a state to be reviewing!

jnothman

Okay, I'm fine with this.

What's new? Emphasise that it may break current usage?

sklearn/feature_extraction/text.py

Co-Authored-By: adrinjalali <[email protected]>

…ikit-learn into countvectorizer/fileinput

sklearn/feature_extraction/tests/test_text.py

sklearn/feature_extraction/text.py

rth · 2019-04-17T15:44:06Z

sklearn/feature_extraction/text.py

+               "and not the file names or the file objects. This warning "
+               "will be removed in v0.23.")
+        try:
+            self.analyzer(fname)


If input="file" why are we checking that analyser works with filenames? fname is a string, right?

we're just making sure that the analyzer neither tries to read from a file object, nor try to open the non-existing file.

glemaitre

I style change in the tests which should solve the issue pointed out by @rth.
Otherwise LGTM.

doc/whats_new/v0.21.rst

sklearn/feature_extraction/text.py

sklearn/feature_extraction/tests/test_text.py

…ileinput

NicolasHug

looks like comments have been addressed, lgtm

…ileinput

Requests have been fulfilled

jnothman · 2019-04-23T03:50:31Z

Yay! Thanks @adrinjalali for digging this one up :)

…t-learn#13641)

scikit-learn#13641)" This reverts commit 9010d9f.

…t-learn#13641)

make sure vectorizers read data from file before analyzing

a0341a8

jnothman reviewed Apr 15, 2019

View reviewed changes

raise a ChangedBehaviorWarning when appropriate

a1de2ae

adrinjalali added 2 commits April 15, 2019 13:52

pep8

d923b21

improve coverage

1c37556

jnothman approved these changes Apr 15, 2019

View reviewed changes

sklearn/feature_extraction/text.py Show resolved Hide resolved

sklearn/feature_extraction/text.py Show resolved Hide resolved

sklearn/feature_extraction/text.py Show resolved Hide resolved

add version

22e3f04

jnothman requested changes Apr 15, 2019

View reviewed changes

sklearn/feature_extraction/text.py Show resolved Hide resolved

validate only if input is not content

8a7ade9

jnothman reviewed Apr 15, 2019

View reviewed changes

jnothman self-requested a review April 15, 2019 14:33

jnothman approved these changes Apr 16, 2019

View reviewed changes

sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved

sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved

jnothman and others added 4 commits April 16, 2019 09:38

Update sklearn/feature_extraction/text.py

c655e04

Co-Authored-By: adrinjalali <[email protected]>

whats_new

2ab5465

modify the warning message and hint the removal version

7e857cf

Merge branch 'countvectorizer/fileinput' of github.com:adrinjalali/sc…

5c0b9b1

…ikit-learn into countvectorizer/fileinput

adrinjalali added this to the 0.21 milestone Apr 16, 2019

adrinjalali changed the title ~~make sure vectorizers read data from file before analyzing~~ [MRG+1] API make sure vectorizers read data from file before analyzing Apr 16, 2019

improve coverage

10f91f3

rth reviewed Apr 17, 2019

View reviewed changes

adrinjalali added 2 commits April 18, 2019 16:51

apply comments

89f1290

pep8

dfc5578

glemaitre self-requested a review April 19, 2019 14:54

glemaitre previously requested changes Apr 19, 2019

View reviewed changes

adrinjalali added 2 commits April 22, 2019 10:28

fix whats_new, add to changed models

028af52

Merge remote-tracking branch 'upstream/master' into countvectorizer/f…

70b01f3

…ileinput

adrinjalali added 2 commits April 22, 2019 10:39

fix tests

05d0f60

add versionchanged

dbe2bff

NicolasHug approved these changes Apr 22, 2019

View reviewed changes

adrinjalali added 2 commits April 22, 2019 21:52

fix test

77ee198

Merge remote-tracking branch 'upstream/master' into countvectorizer/f…

0d540f3

…ileinput

jnothman merged commit 70fd42e into scikit-learn:master Apr 23, 2019

adrinjalali deleted the countvectorizer/fileinput branch April 23, 2019 06:52

adrinjalali mentioned this pull request Apr 23, 2019

DOC minor fix to whats_new #13695

Merged

jnothman mentioned this pull request Apr 23, 2019

DOC Describe what's new categories #13697

Merged

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019

FIX make sure vectorizers read data from file before analyzing (sciki…

7005ab4

…t-learn#13641)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

FIX make sure vectorizers read data from file before analyzing (sciki…

9010d9f

…t-learn#13641)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX make sure vectorizers read data from file before analyzing (

2621d67

scikit-learn#13641)" This reverts commit 9010d9f.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX make sure vectorizers read data from file before analyzing (

9470ebd

scikit-learn#13641)" This reverts commit 9010d9f.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FIX make sure vectorizers read data from file before analyzing (sciki…

c2e3f80

…t-learn#13641)

NicolasHug mentioned this pull request Dec 11, 2019

[MRG] More deprecations for 0.23 #15860

Merged

adrinjalali mentioned this pull request Jan 13, 2020

MNT remove check for deprecated behavior in test.py #16109

Merged

Uh oh!

[MRG+1] API make sure vectorizers read data from file before analyzing #13641

[MRG+1] API make sure vectorizers read data from file before analyzing #13641

Uh oh!

Conversation

adrinjalali commented Apr 14, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Apr 15, 2019

Uh oh!

jnothman commented Apr 15, 2019 via email

Uh oh!

adrinjalali commented Apr 15, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adrinjalali commented Apr 15, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rth Apr 17, 2019

Choose a reason for hiding this comment

Uh oh!

adrinjalali Apr 18, 2019

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Apr 23, 2019

Uh oh!

Uh oh!