Description
With the recent improvements to OneHotEncoder
and ColumnTransformer
, we now have estimators in scikit-learn that explicitly accept object-dtyped data or heterogenous data / DataFrames. In several cases, we have noted that we need to update utility functions like check_array
to better handle those new use cases.
Some cases encountered:
-
In the
OneHotEncoder
, we have some special code to deal with lists of strings / mixed numeric and string:scikit-learn/sklearn/preprocessing/_encoders.py
Lines 49 to 53 in 2d66034
This part is specifically targeting lists, where numpy
np.array([mixed list])
converts everything to string once there is at least a single string in the lists. -
Also in the
OneHotEncoder
, there is a second part to check for NaN values in case of object data:scikit-learn/sklearn/preprocessing/_encoders.py
Lines 55 to 58 in 2d66034
which could also be moved to
check_array
(see BUG: always raise on NaN in OneHotEncoder for object dtype data #12033 (comment)) -
In the
ColumnTransformer
, we want to convert lists to arrays, but preserve DataFrames / sparse matrices (active PR: [MRG] Convert ColumnTransformer input list to numpy array #12104), for which we currently are adding this workaround:scikit-learn/sklearn/compose/_column_transformer.py
Lines 515 to 519 in 864c2cc
-
In the
OneHotEncoder
, we do encoding column by column, so it would not be needed to first convert a DataFrame to an array before accessing columns, we could do the checking of the data only after accessing a single column (to avoid dtype conversion). (opened a separate issue about this: ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147) -
In the
FunctionTransformer
PR to pass-through dataframes, there was discussion about this as well: [MRG+2] ENH Passthrough DataFrame in FunctionTransformer #11043 (currently, the FunctionTransformer cannot do both validation and pass-through as a DataFrame).
Related issue: #11447 about avoiding a np.asarray
call in check_array
for duck-typed arrays (eg to preserve dask arrays instead of converting them to numpy arrays). There is certainly overlap with this issue, but since the heterogeneous data also occur in some other use cases, opening a separate issue.
But @jnothman mentioned in that issue the idea to have something like accept={'array','array-like','frame-like'}
in check_array
.
So from the above use cases, some possible improvements for check_array
to better handle object / heterogeneous data:
- Ability to pass through DataFrames as-is (as DataFrames, but potentially still check dtypes, missing values, etc)
- Have a
accept_frame=True/False
like there is aaccept_sparse
? Or have a more generalaccept=
keyword that can be extended later (egaccept=['array', 'frame']
)
- Have a
- A way to indicate that the dtype should be "numeric-or-object". Meaning: if it is numeric, keep numeric, if it is object keep object (up to here it would be 'dtype=None' behaviour), but if no dtype yet, the coerced array should be either numeric or object). Basically ensuring a lists gets converted to object array unless it is fully numerical.
- A new value that can be passed to
dtype
?check_array(X, dtype="numeric-or-object")
? Or making sure that something likedtype=['numeric', 'object]
would work (although this would not preserve a string dtype).
- A new value that can be passed to
- Follow the
force_all_finite
keyword also for object dtype data (until nowcheck_array
did not need to handle with object dtype data, so it could assume that only float data could ever contain NaNs.