Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Improvements to check_array to handle heterogenous / object data #12148

Open
@jorisvandenbossche

Description

@jorisvandenbossche

With the recent improvements to OneHotEncoder and ColumnTransformer, we now have estimators in scikit-learn that explicitly accept object-dtyped data or heterogenous data / DataFrames. In several cases, we have noted that we need to update utility functions like check_array to better handle those new use cases.

Some cases encountered:

Related issue: #11447 about avoiding a np.asarray call in check_array for duck-typed arrays (eg to preserve dask arrays instead of converting them to numpy arrays). There is certainly overlap with this issue, but since the heterogeneous data also occur in some other use cases, opening a separate issue.
But @jnothman mentioned in that issue the idea to have something like accept={'array','array-like','frame-like'} in check_array.


So from the above use cases, some possible improvements for check_array to better handle object / heterogeneous data:

  • Ability to pass through DataFrames as-is (as DataFrames, but potentially still check dtypes, missing values, etc)
    • Have a accept_frame=True/False like there is a accept_sparse ? Or have a more general accept= keyword that can be extended later (eg accept=['array', 'frame'])
  • A way to indicate that the dtype should be "numeric-or-object". Meaning: if it is numeric, keep numeric, if it is object keep object (up to here it would be 'dtype=None' behaviour), but if no dtype yet, the coerced array should be either numeric or object). Basically ensuring a lists gets converted to object array unless it is fully numerical.
    • A new value that can be passed to dtype ? check_array(X, dtype="numeric-or-object")? Or making sure that something like dtype=['numeric', 'object] would work (although this would not preserve a string dtype).
  • Follow the force_all_finite keyword also for object dtype data (until now check_array did not need to handle with object dtype data, so it could assume that only float data could ever contain NaNs.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions