You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the recent improvements to OneHotEncoder and ColumnTransformer, we now have estimators in scikit-learn that explicitly accept object-dtyped data or heterogenous data / DataFrames. In several cases, we have noted that we need to update utility functions like check_array to better handle those new use cases.
Some cases encountered:
In the OneHotEncoder, we have some special code to deal with lists of strings / mixed numeric and string:
This part is specifically targeting lists, where numpy np.array([mixed list]) converts everything to string once there is at least a single string in the lists.
Also in the OneHotEncoder, there is a second part to check for NaN values in case of object data:
In the OneHotEncoder, we do encoding column by column, so it would not be needed to first convert a DataFrame to an array before accessing columns, we could do the checking of the data only after accessing a single column (to avoid dtype conversion). (opened a separate issue about this: ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147)
In the FunctionTransformer PR to pass-through dataframes, there was discussion about this as well: [MRG+2] ENH Passthrough DataFrame in FunctionTransformer #11043 (currently, the FunctionTransformer cannot do both validation and pass-through as a DataFrame).
Related issue: #11447 about avoiding a np.asarray call in check_array for duck-typed arrays (eg to preserve dask arrays instead of converting them to numpy arrays). There is certainly overlap with this issue, but since the heterogeneous data also occur in some other use cases, opening a separate issue.
But @jnothman mentioned in that issue the idea to have something like accept={'array','array-like','frame-like'} in check_array.
So from the above use cases, some possible improvements for check_array to better handle object / heterogeneous data:
Ability to pass through DataFrames as-is (as DataFrames, but potentially still check dtypes, missing values, etc)
Have a accept_frame=True/False like there is a accept_sparse ? Or have a more general accept= keyword that can be extended later (eg accept=['array', 'frame'])
A way to indicate that the dtype should be "numeric-or-object". Meaning: if it is numeric, keep numeric, if it is object keep object (up to here it would be 'dtype=None' behaviour), but if no dtype yet, the coerced array should be either numeric or object). Basically ensuring a lists gets converted to object array unless it is fully numerical.
A new value that can be passed to dtype ? check_array(X, dtype="numeric-or-object")? Or making sure that something like dtype=['numeric', 'object] would work (although this would not preserve a string dtype).
Follow the force_all_finite keyword also for object dtype data (until now check_array did not need to handle with object dtype data, so it could assume that only float data could ever contain NaNs.
The text was updated successfully, but these errors were encountered:
With the recent improvements to
OneHotEncoder
andColumnTransformer
, we now have estimators in scikit-learn that explicitly accept object-dtyped data or heterogenous data / DataFrames. In several cases, we have noted that we need to update utility functions likecheck_array
to better handle those new use cases.Some cases encountered:
In the
OneHotEncoder
, we have some special code to deal with lists of strings / mixed numeric and string:scikit-learn/sklearn/preprocessing/_encoders.py
Lines 49 to 53 in 2d66034
This part is specifically targeting lists, where numpy
np.array([mixed list])
converts everything to string once there is at least a single string in the lists.Also in the
OneHotEncoder
, there is a second part to check for NaN values in case of object data:scikit-learn/sklearn/preprocessing/_encoders.py
Lines 55 to 58 in 2d66034
which could also be moved to
check_array
(see BUG: always raise on NaN in OneHotEncoder for object dtype data #12033 (comment))In the
ColumnTransformer
, we want to convert lists to arrays, but preserve DataFrames / sparse matrices (active PR: [MRG] Convert ColumnTransformer input list to numpy array #12104), for which we currently are adding this workaround:scikit-learn/sklearn/compose/_column_transformer.py
Lines 515 to 519 in 864c2cc
In the
OneHotEncoder
, we do encoding column by column, so it would not be needed to first convert a DataFrame to an array before accessing columns, we could do the checking of the data only after accessing a single column (to avoid dtype conversion). (opened a separate issue about this: ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147)In the
FunctionTransformer
PR to pass-through dataframes, there was discussion about this as well: [MRG+2] ENH Passthrough DataFrame in FunctionTransformer #11043 (currently, the FunctionTransformer cannot do both validation and pass-through as a DataFrame).Related issue: #11447 about avoiding a
np.asarray
call incheck_array
for duck-typed arrays (eg to preserve dask arrays instead of converting them to numpy arrays). There is certainly overlap with this issue, but since the heterogeneous data also occur in some other use cases, opening a separate issue.But @jnothman mentioned in that issue the idea to have something like
accept={'array','array-like','frame-like'}
incheck_array
.So from the above use cases, some possible improvements for
check_array
to better handle object / heterogeneous data:accept_frame=True/False
like there is aaccept_sparse
? Or have a more generalaccept=
keyword that can be extended later (egaccept=['array', 'frame']
)dtype
?check_array(X, dtype="numeric-or-object")
? Or making sure that something likedtype=['numeric', 'object]
would work (although this would not preserve a string dtype).force_all_finite
keyword also for object dtype data (until nowcheck_array
did not need to handle with object dtype data, so it could assume that only float data could ever contain NaNs.The text was updated successfully, but these errors were encountered: