Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Improvements to check_array to handle heterogenous / object data #12148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Sep 24, 2018 · 0 comments
Open

Comments

@jorisvandenbossche
Copy link
Member

With the recent improvements to OneHotEncoder and ColumnTransformer, we now have estimators in scikit-learn that explicitly accept object-dtyped data or heterogenous data / DataFrames. In several cases, we have noted that we need to update utility functions like check_array to better handle those new use cases.

Some cases encountered:

Related issue: #11447 about avoiding a np.asarray call in check_array for duck-typed arrays (eg to preserve dask arrays instead of converting them to numpy arrays). There is certainly overlap with this issue, but since the heterogeneous data also occur in some other use cases, opening a separate issue.
But @jnothman mentioned in that issue the idea to have something like accept={'array','array-like','frame-like'} in check_array.


So from the above use cases, some possible improvements for check_array to better handle object / heterogeneous data:

  • Ability to pass through DataFrames as-is (as DataFrames, but potentially still check dtypes, missing values, etc)
    • Have a accept_frame=True/False like there is a accept_sparse ? Or have a more general accept= keyword that can be extended later (eg accept=['array', 'frame'])
  • A way to indicate that the dtype should be "numeric-or-object". Meaning: if it is numeric, keep numeric, if it is object keep object (up to here it would be 'dtype=None' behaviour), but if no dtype yet, the coerced array should be either numeric or object). Basically ensuring a lists gets converted to object array unless it is fully numerical.
    • A new value that can be passed to dtype ? check_array(X, dtype="numeric-or-object")? Or making sure that something like dtype=['numeric', 'object] would work (although this would not preserve a string dtype).
  • Follow the force_all_finite keyword also for object dtype data (until now check_array did not need to handle with object dtype data, so it could assume that only float data could ever contain NaNs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants