-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] ENH Passthrough DataFrame in FunctionTransformer #11043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jnothman @amueller @rth I would like to have some feedback on this. One of my main concern now is how to modify the documentation. Shall we modify some example to show the use with pandas (dependency issue)? The parameter is yet not the default one as well. WDYT? |
if self._validate: | ||
if hasattr(X, 'loc') and self._validate == 'array-or-frame': | ||
if self.force_all_finite: | ||
_assert_all_finite(X.values, allow_nan=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wasn't there some case when X.values is not the same as np.array(X)? Also, this is possibly expensive, as we're converting to numpy array and then discard it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this is possibly expensive, as we're converting to numpy array and then discard it.
It should be expensive memory wise, but apparently this is faster:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
0 10000 non-null int64
1 10000 non-null float64
0 10000 non-null int64
1 10000 non-null float64
0 10000 non-null int64
1 10000 non-null float64
0 10000 non-null int64
1 10000 non-null float64
0 10000 non-null int64
1 10000 non-null float64
dtypes: float64(5), int64(5)
memory usage: 781.3 KB
>>> %timeit df.isna().any().any()
1.9 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit _assert_all_finite(df.values, allow_nan=False)
163 µs ± 4.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
@jorisvandenbossche do you have an advise regarding the internal of pandas to check inf
and nan
in a DataFrame the most efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is mainly the .any().any()
that makes it slow. Once you have done .isna()
you have a full frame of booleans, so getting the values then is cheap:
In [11]: %timeit df.isna().any().any()
1.87 ms ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [12]: %timeit df.isna()
411 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [13]: %timeit df.isna().values.any()
431 µs ± 7.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Still not as fast as the _assert_all_finite
though, but much closer.
wasn't there some case when X.values is not the same as np.array(X)?
normally for a dataframe values will always be an array. For series this is indeed not always true, but, in the _assert_all_finite
function np.asanyarray
is still called, so this shouldn't matter.
regarding asarray not working on frames, that was a bug with SparseSeries
and perhaps SparseDataFrame
|
- If 'allow-nan', accept only np.nan values in X. Values cannot be | ||
infinite. | ||
|
||
Applied only when ``validate=True``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, in the future the default for validate
will be 'array-or-frame'
. That means that this force_all_finite
will not be used anymore once that change is made?
That seems a bit strange as a) that will be a change in behaviour and b) then this default in the documentation will be misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means that this force_all_finite will not be used anymore once that change is made
The aim is to keep DataFrame as DataFrame with converting it (calling check_array
). So with 'array-or-frame'
this behavior is enforce but I still keep checking for the finiteness in array or Frame.
But the documentation is wrong :)
|
||
- If True, then X will be converted to a 2-dimensional NumPy array or | ||
sparse matrix. If the conversion is not possible an exception is | ||
raised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You left out the "or X contains NaN or infinity" part, I suppose because it is now controlled by force_all_finite
?
I think it might still be useful to at least reference the other keyword, since it is also about "validation"
doc/whats_new/v0.20.rst
Outdated
- :class:`preprocessing.FunctionTransformer` is accepting pandas DataFrame in | ||
``func`` without converting to a NumPy array when | ||
``validate='array-or-frame``. :issue:`10655` by :user:`Guillaume Lemaitre | ||
<glemaitre>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add note about the new force_all_finite
keyword?
sparse matrix. If the conversion is not possible an exception is | ||
raised. | ||
- If False, then there is no input validation | ||
- If 'array-or-frame', X will be pass-through if this is a pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass-through -> passed through ?
if self.validate is None: | ||
self._validate = True | ||
warnings.warn("The default validate=True will be replaced by " | ||
"validate='array-or-frame' in 0.22.", FutureWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we let this depend on the type of X ? Eg only do this is if the input is actually a dataframe? (related to my other question whether the behaviour for arrays will change with regard to NaNs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can only do it for DataFrame. The behavior for array will not change: we will enforce finiteness by default (no NaN and infinite) which the equivalent of validate=True.
@jorisvandenbossche Could you have another look. |
- If 'allow-nan', accept only np.nan values in X. Values cannot be | ||
infinite. | ||
|
||
This parameter is discarded when ``validate=False``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If validate is False, this has no effect" for more consistency with the accept_sparse
docsting above.
This parameter is discarded when ``validate=False``. | ||
|
||
.. versionadded:: 0.20 | ||
``force_all_finite`` was added to let pass NaN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is more complex than just letting pass NaN.
Maybe just keep .. versionadded:: 0.20
alone: the description of what it does is above.
.format(self.force_all_finite)) | ||
|
||
if self._validate: | ||
if hasattr(X, 'loc') and self._validate == 'array-or-frame': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xarray.DataArray will pass here while it shouldn't.
Instead of indirectly detecting DataFrames by attribute names (here and elsewhere in scikit-learn) maybe it would make sense to make an actual isinstance
check when possible? e.g.
def is_dataframe(obj):
try:
import pandas as pd
return isinstance(obj, pd.DataFrame)
except ImportError:
return False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually that might allow people to pass xarray data array and it should even work. The validation for the nan and inf should is currently working for those arrays. Then, we have the following solutions:
- Let the ducktyping and do not document that xarray is supported;
- Let the ducktyping and say that we can pass xarray.
- Make a strict isinstance.
@rth WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting.
Let the ducktyping and do not document that xarray is supported;
Let's keep things as they are now, then. The second solution would require some more unit tests (and optionally a CI build with xarray) and since pandas support is only in progress, I don't think it would be reasonable to officially support yet another data format at this time..
``validate=True`` as default will be replaced by | ||
``validate='array-or-frame'`` in 0.22. | ||
|
||
.. versionadded:: 0.20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionchanged
X = np.random.randn(100, 10) | ||
transformer = FunctionTransformer() | ||
transformer.set_params(**params) | ||
with pytest.raises(ValueError, match=msg_err): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very nice and so much better than the approach proposed in the second example of Pytest doc, but this only works for pytest >=3.10 (released less than a year ago), before it ignores the match
argument (tested with 3.07) producing false positives.
As long as CI used an up to date pytest this should not be an issue, but if we use it it would be worth specifying minimum pytest requirements in the dev documentation.
Can we step back a moment and ask what practical benefit validate=True provides users? It means the underlying function will work regardless of whether input was list or array, but that is true for most numpy or pandas functions regardless. It helps the user with some sanity checking (finiteness, 2d) but maybe that's better handled with a warning, by validating the output perhaps. Should we just be changing validate's default to False? |
I should admit that I have difficulty to follow your thoughts. If I understand well, Personally, I was thinking that it could be great to have:
In addition, when working on this PR, I was thinking that |
Each time I used I still think As to the rest of @glemaitre's #11043 (comment) I don't really have a strong opinion about it.. |
|
||
@pytest.mark.parametrize( | ||
"is_dataframe", | ||
[True, False] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I know that's the code style in pytest examples, but do you think it's really worth spreading something that can be written in 1 line over 4 lines? I'm not convinced it helps readability..
doc/whats_new/v0.20.rst
Outdated
@@ -260,6 +260,12 @@ Miscellaneous | |||
:issue:`9101` by :user:`alex-33 <alex-33>` | |||
and :user:`Maskani Filali Mohamed <maskani-moh>`. | |||
|
|||
- :class:`preprocessing.FunctionTransformer` is accepting pandas DataFrame in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"now accepts" ?
doc/whats_new/v0.20.rst
Outdated
@@ -572,6 +578,10 @@ Misc | |||
acts as an upper bound on iterations. | |||
:issue:`#10982` by :user:`Juliet Lawton <julietcl>` | |||
|
|||
- In :class:`preprocessing.FunctionTransformer`, the default of ``validate`` | |||
will changed from ``True`` to ``'array-or-frame'`` in 0.22. :issue:`10655` by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"will be"
- If True, then X will be converted to a 2-dimensional NumPy array or | ||
sparse matrix. If the conversion is not possible an exception is | ||
raised. | ||
- If False, then there is no input validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the "then" here and in the bullet point above
- If False, then there is no input validation. | ||
|
||
When X is validated, the parameters ``accept_sparse`` and | ||
``force_all_finite`` will control the validation for the sparsity and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"will control" -> "control"
self._validate = self.validate | ||
|
||
if ((not isinstance(self._validate, bool)) and | ||
self._validate != 'array-or-frame'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just _validate not in (True, False, 'array-or-frame')
?
"'array-or-frame'. Got {!r} instead." | ||
.format(self._validate)) | ||
if ((not isinstance(self.force_all_finite, bool)) and | ||
self.force_all_finite != 'allow-nan'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same self.force_all_finite not in (True, False, 'array-or-frame')
if self.force_all_finite: | ||
_assert_all_finite(X.values, allow_nan=False | ||
if self.force_all_finite is True | ||
else True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could replace,
allow_nan = (False if self.force_all_finite is True else True)
with
allow_nan = not (self.force_all_finite is True)
I address @rth comments. @jnothman could you give some insights regarding #11043 (comment). |
Yes, i like the proposed API, as long as the simple cases are straightforward for users to understand. |
I feel that's exposing a lot of internals of check_array that I don't really want to maintain / keep backward compatible.
Where's DataArray from? Xray? Can we easily detect things implementing numpy protocols? DataFrame doesn't right? |
DataFrame exposes `__array__` but, for instance, does indexing differently.
so I'm not sure what you mean by numpy protocol
|
Ok the question is what should be the criterion. exposing |
I don't think there is an easy way to duck type this
|
Probably not, which is why I was a bit confused by @glemaitre's suggestion. |
would it not be possible to make something like: if hasattr(X, 'loc'):
if convert_to_array:
array = np.asarray(X)
else:
return X |
but it doesn't support iloc so i don't know what that says for downstream
support
|
Yep but what happen later on in the function and how to handle the array is
on the user, isn't it?
…On 24 May 2018 at 12:06, Joel Nothman ***@***.***> wrote:
but it doesn't support iloc so i don't know what that says for downstream
support
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11043 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHG9P9FjwISXqBSXIq4likoX0PMkTUplks5t1oYOgaJpZM4Tr_P->
.
--
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
|
|
||
accept_sparse : boolean, optional | ||
Indicate that func accepts a sparse matrix as input. If validate is | ||
False, this has no effect. Otherwise, if accept_sparse is false, | ||
sparse matrix inputs will cause an exception to be raised. | ||
|
||
force_all_finite : boolean or 'allow-nan', optional default=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go for 'allow-nan'
as the default? As we are going in that direction for other transformers as well?
+1, I also woudn't directly expose that. To echo what @jnothman said above: is there actually a good reason we validate at all (by default)? I think it would make sense to have |
I start to be convinced by |
@jnothman I made the change. Does the what's new entry is explicit enough regarding the conversion to array of |
sparse matrix. If the conversion is not possible an exception is | ||
raised. | ||
- If False, then there is no input validation | ||
- If 'array-or-frame', X will be pass-through if this is a pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pass-through" -> "passed through unchanged"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "passed through unchanged if it is a 2d array, sparse matrix or DataFrame"
- If False, then there is no input validation | ||
- If 'array-or-frame', X will be pass-through if this is a pandas | ||
DataFrame or converted to a 2-dimensional array or sparse matrix. In | ||
this latest case, an exception will be raised if the conversion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what this means
if self._validate: | ||
return check_array(X, accept_sparse=self.accept_sparse) | ||
else: | ||
# convert X to NumPy array when this is a list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I don't think this is an acceptable change in terms of backwards compatibility.
you may get memory issues etc if it's a list of long strings, for instance, since numpy will copy it into contiguous memory.
Also, you've suggested above that it will be converted into a 2d array, but there's no 2d here.
What I was suggesting is that if a new option becomes default, it should
probably have this property about lists, not that it should happen now.
|
So you mean to only a raise FutureWarning for the moment? |
OK so this is good to be reviewed once again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is good, for now at least.
with pytest.warns(expected_warning) as results: | ||
transformer.fit_transform(X) | ||
if expected_warning is None: | ||
assert len(results) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this assert not results
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote +1 for using validate=False as default since I don't think it's the duty of FunctionTransformer to check the input. What's more, I agree that DataFrame in & 2d-array out is not friendly.
doc/modules/preprocessing.rst
Outdated
@@ -663,7 +663,7 @@ error with a ``filterwarnings``:: | |||
>>> import warnings | |||
>>> warnings.filterwarnings("error", message=".*check_inverse*.", | |||
... category=UserWarning, append=False) | |||
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated change
doc/whats_new/v0.20.rst
Outdated
@@ -301,6 +301,7 @@ Miscellaneous | |||
:issue:`9101` by :user:`alex-33 <alex-33>` | |||
and :user:`Maskani Filali Mohamed <maskani-moh>`. | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated change
doc/whats_new/v0.20.rst
Outdated
@@ -643,6 +644,10 @@ Misc | |||
- Invalid input for :class:`model_selection.ParameterGrid` now raises TypeError. | |||
:issue:`10928` by :user:`Solutus Immensus <solutusimmensus>` | |||
|
|||
- In :class:`preprocessing.FunctionTransformer`, the default of ``validate`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why in the Misc section? Maybe move to Preprocessing section?
self.func = func | ||
self.inverse_func = inverse_func | ||
self.validate = validate | ||
self.accept_sparse = accept_sparse | ||
self.force_all_finite = force_all_finite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of it? Seems that force_all_finite is not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, it is a remaining from another option
ping @qinhanmin2014 ready to be merged :) |
accept_sparse=False, pass_y='deprecated', check_inverse=True, | ||
kw_args=None, inv_kw_args=None): | ||
def __init__(self, func=None, inverse_func=None, validate=None, | ||
accept_sparse=False, force_all_finite=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to remove force_all_finite
here
ping @jnothman @glemaitre and other people around |
In practice changing all the examples and tests can be a nuisance because
it makes it unclear whether it's a meaningful part of the test, and then we
fail to change it back. But we probably should be scaling features in svm
examples. Basically, either way is annoying.
|
I'll merge this one. This PR only modifies tests when they fails, which seems reasonable from my side. @jnothman I think someone need to make a decision so that we can check merged PRs and tell contributors what to do in open PRs. Several PRs which change the default value of a parameter are marked as blocker. |
Reference Issues/PRs
closes #10655
What does this implement/fix? Explain your changes.
Added the following option
validate=False
will be the default in the future.validate=False
.Any other comments?