Thanks to visit codestin.com
Credit goes to github.com

Skip to content

check_array does not gracefully fail with pd.NA #25078

@glemaitre

Description

@glemaitre

When a NumPy array or a pandas series contains pd.NA, it will not gracefully fail.

In [1]: import pandas as pd

In [2]: s = pd.Series([1, 2, None], dtype="Int64")

In [3]: from sklearn.utils.validation import check_array

In [4]: check_array(s, ensure_2d=False)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [4], line 1
----> 1 check_array(s, ensure_2d=False)

File ~/Documents/packages/scikit-learn/sklearn/utils/validation.py:912, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    906         raise ValueError(
    907             "Found array with dim %d. %s expected <= 2."
    908             % (array.ndim, estimator_name)
    909         )
    911     if force_all_finite:
--> 912         _assert_all_finite(
    913             array,
    914             input_name=input_name,
    915             estimator_name=estimator_name,
    916             allow_nan=force_all_finite == "allow-nan",
    917         )
    919 if ensure_min_samples > 0:
    920     n_samples = _num_samples(array)

File ~/Documents/packages/scikit-learn/sklearn/utils/validation.py:110, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    108 # for object dtype data, we only check for NaNs (GH-13254)
    109 if X.dtype == np.dtype("object") and not allow_nan:
--> 110     if _object_dtype_isnan(X).any():
    111         raise ValueError("Input contains NaN")
    113 # We need only consider float arrays, hence can early return for all else.

AttributeError: 'bool' object has no attribute 'any'

The reason is that we _object_dtype_isnan intend the following:

In [7]: s.to_numpy() != s.to_numpy()
<ipython-input-7-e1263eaa78fa>:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
  s.to_numpy() != s.to_numpy()
Out[7]: True

that returns a single boolean because if fails when comparing pd.NA:

In [5]: bool(pd.NA == pd.NA)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [5], line 1
----> 1 bool(pd.NA == pd.NA)

File ~/mambaforge/envs/dev/lib/python3.10/site-packages/pandas/_libs/missing.pyx:382, in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous

We should probably change the code to handle this case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions