Improvements to check_array to handle heterogenous / object data

With the recent improvements to `OneHotEncoder` and `ColumnTransformer`, we now have estimators in scikit-learn that explicitly accept object-dtyped data or heterogenous data / DataFrames. In several cases, we have noted that we need to update utility functions like `check_array` to better handle those new use cases.

Some cases encountered:

* In the `OneHotEncoder`, we have some special code to deal with lists of strings / mixed numeric and string:

  https://github.com/scikit-learn/scikit-learn/blob/2d6603448a9d98212caaec9595b9b39067df54be/sklearn/preprocessing/_encoders.py#L49-L53

  This part is specifically targeting lists, where numpy `np.array([mixed list])` converts everything to string once there is at least a single string in the lists.

* Also in the `OneHotEncoder`, there is a second part to check for NaN values in case of object data:

  https://github.com/scikit-learn/scikit-learn/blob/2d6603448a9d98212caaec9595b9b39067df54be/sklearn/preprocessing/_encoders.py#L55-L58

  which could also be moved to `check_array` (see https://github.com/scikit-learn/scikit-learn/pull/12033#issuecomment-420911048)

* In the `ColumnTransformer`, we want to convert lists to arrays, but preserve DataFrames / sparse matrices (active PR: https://github.com/scikit-learn/scikit-learn/pull/12104), for which we currently are adding this workaround:

  https://github.com/scikit-learn/scikit-learn/blob/864c2cce14c87831692fb08fb782f2fd3c97b654/sklearn/compose/_column_transformer.py#L515-L519

* In the `OneHotEncoder`, we do encoding column by column, so it would not be needed to first convert a DataFrame to an array before accessing columns, we could do the checking of the data only after accessing a single column (to avoid dtype conversion). (opened a separate issue about this: https://github.com/scikit-learn/scikit-learn/issues/12147)

* In the `FunctionTransformer` PR to pass-through dataframes, there was discussion about this as well: https://github.com/scikit-learn/scikit-learn/pull/11043 (currently, the FunctionTransformer cannot do both validation and pass-through as a DataFrame).

Related issue: https://github.com/scikit-learn/scikit-learn/issues/11447 about avoiding a `np.asarray` call in `check_array` for duck-typed arrays (eg to preserve dask arrays instead of converting them to numpy arrays). There is certainly overlap with this issue, but since the heterogeneous data also occur in some other use cases, opening a separate issue. 
But @jnothman mentioned in that issue the idea to have something like `accept={'array','array-like','frame-like'}` in `check_array`.

----

So from the above use cases, some possible improvements for `check_array` to better handle object / heterogeneous data:

* Ability to pass through DataFrames as-is (as DataFrames, but potentially still check dtypes, missing values, etc)
  * Have a `accept_frame=True/False` like there is a `accept_sparse` ? Or have a more general `accept=` keyword that can be extended later (eg `accept=['array', 'frame']`)
* A way to indicate that the dtype should be "numeric-or-object". Meaning: if it is numeric, keep numeric, if it is object keep object (up to here it would be 'dtype=None' behaviour), but if no dtype yet, the coerced array should be either numeric or object). Basically ensuring a lists gets converted to object array unless it is fully numerical.
  * A new value that can be passed to `dtype` ? `check_array(X, dtype="numeric-or-object")`? Or making sure that something like `dtype=['numeric', 'object]` would work (although this would not preserve a string dtype).
* Follow the `force_all_finite` keyword also for object dtype data (until now `check_array` did not need to handle with object dtype data, so it could assume that only float data could ever contain NaNs.




	X_temp = check_array(X, dtype=None)
	if not hasattr(X, 'dtype') and np.issubdtype(X_temp.dtype, np.str_):
	X = check_array(X, dtype=np.object)
	else:
	X = X_temp

	if X.dtype == np.dtype('object'):
	if not _get_config()['assume_finite']:
	if _object_dtype_isnan(X).any():
	raise ValueError("Input contains NaN")

	def _check_X(X):
	"""Use check_array only on lists and other non-array-likes / sparse"""
	if hasattr(X, '__array__') or sparse.issparse(X):
	return X
	return check_array(X, force_all_finite='allow-nan', dtype=np.object)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improvements to check_array to handle heterogenous / object data #12148

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improvements to check_array to handle heterogenous / object data #12148

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions