-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] API make sure vectorizers read data from file before analyzing #13641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a0341a8
a1de2ae
d923b21
1c37556
22e3f04
8a7ade9
c655e04
2ab5465
7e857cf
5c0b9b1
10f91f3
89f1290
dfc5578
028af52
70b01f3
05d0f60
dbe2bff
77ee198
0d540f3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,6 +31,7 @@ | |
from ..utils.validation import check_is_fitted, check_array, FLOAT_DTYPES | ||
from ..utils import _IS_32BIT | ||
from ..utils.fixes import _astype_copy_false | ||
from ..exceptions import ChangedBehaviorWarning | ||
|
||
|
||
__all__ = ['HashingVectorizer', | ||
|
@@ -304,10 +305,34 @@ def _check_stop_words_consistency(self, stop_words, preprocess, tokenize): | |
self._stop_words_id = id(self.stop_words) | ||
return 'error' | ||
|
||
def _validate_custom_analyzer(self): | ||
# This is to check if the given custom analyzer expects file or a | ||
# filename instead of data. | ||
# Behavior changed in v0.21, function could be removed in v0.23 | ||
import tempfile | ||
with tempfile.NamedTemporaryFile() as f: | ||
fname = f.name | ||
# now we're sure fname doesn't exist | ||
|
||
msg = ("Since v0.21, vectorizers pass the data to the custom analyzer " | ||
"and not the file names or the file objects. This warning " | ||
"will be removed in v0.23.") | ||
try: | ||
self.analyzer(fname) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we're just making sure that the analyzer neither tries to read from a file object, nor try to open the non-existing file. |
||
except FileNotFoundError: | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
warnings.warn(msg, ChangedBehaviorWarning) | ||
except AttributeError as e: | ||
if str(e) == "'str' object has no attribute 'read'": | ||
warnings.warn(msg, ChangedBehaviorWarning) | ||
except Exception: | ||
pass | ||
jnothman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def build_analyzer(self): | ||
"""Return a callable that handles preprocessing and tokenization""" | ||
if callable(self.analyzer): | ||
return self.analyzer | ||
if self.input in ['file', 'filename']: | ||
self._validate_custom_analyzer() | ||
return lambda doc: self.analyzer(self.decode(doc)) | ||
|
||
preprocess = self.build_preprocessor() | ||
|
||
|
@@ -490,6 +515,11 @@ class HashingVectorizer(BaseEstimator, VectorizerMixin, TransformerMixin): | |
If a callable is passed it is used to extract the sequence of features | ||
out of the raw, unprocessed input. | ||
|
||
.. versionchanged:: 0.21 | ||
Since v0.21, if ``input`` is ``filename`` or ``file``, the data is | ||
first read from the file and then passed to the given callable | ||
analyzer. | ||
|
||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||
n_features : integer, default=(2 ** 20) | ||
The number of features (columns) in the output matrices. Small numbers | ||
of features are likely to cause hash collisions, but large numbers | ||
|
@@ -745,6 +775,11 @@ class CountVectorizer(BaseEstimator, VectorizerMixin): | |
If a callable is passed it is used to extract the sequence of features | ||
out of the raw, unprocessed input. | ||
|
||
.. versionchanged:: 0.21 | ||
Since v0.21, if ``input`` is ``filename`` or ``file``, the data is | ||
first read from the file and then passed to the given callable | ||
analyzer. | ||
|
||
max_df : float in range [0.0, 1.0] or int, default=1.0 | ||
When building the vocabulary ignore terms that have a document | ||
frequency strictly higher than the given threshold (corpus-specific | ||
|
@@ -1369,6 +1404,11 @@ class TfidfVectorizer(CountVectorizer): | |
If a callable is passed it is used to extract the sequence of features | ||
out of the raw, unprocessed input. | ||
|
||
.. versionchanged:: 0.21 | ||
Since v0.21, if ``input`` is ``filename`` or ``file``, the data is | ||
first read from the file and then passed to the given callable | ||
analyzer. | ||
|
||
stop_words : string {'english'}, list, or None (default=None) | ||
If a string, it is passed to _check_stop_list and the appropriate stop | ||
list is returned. 'english' is currently the only supported string | ||
|
Uh oh!
There was an error while loading. Please reload this page.