-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Labels
Description
When a NumPy array or a pandas series contains pd.NA
, it will not gracefully fail.
In [1]: import pandas as pd
In [2]: s = pd.Series([1, 2, None], dtype="Int64")
In [3]: from sklearn.utils.validation import check_array
In [4]: check_array(s, ensure_2d=False)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [4], line 1
----> 1 check_array(s, ensure_2d=False)
File ~/Documents/packages/scikit-learn/sklearn/utils/validation.py:912, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
906 raise ValueError(
907 "Found array with dim %d. %s expected <= 2."
908 % (array.ndim, estimator_name)
909 )
911 if force_all_finite:
--> 912 _assert_all_finite(
913 array,
914 input_name=input_name,
915 estimator_name=estimator_name,
916 allow_nan=force_all_finite == "allow-nan",
917 )
919 if ensure_min_samples > 0:
920 n_samples = _num_samples(array)
File ~/Documents/packages/scikit-learn/sklearn/utils/validation.py:110, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
108 # for object dtype data, we only check for NaNs (GH-13254)
109 if X.dtype == np.dtype("object") and not allow_nan:
--> 110 if _object_dtype_isnan(X).any():
111 raise ValueError("Input contains NaN")
113 # We need only consider float arrays, hence can early return for all else.
AttributeError: 'bool' object has no attribute 'any'
The reason is that we _object_dtype_isnan
intend the following:
In [7]: s.to_numpy() != s.to_numpy()
<ipython-input-7-e1263eaa78fa>:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
s.to_numpy() != s.to_numpy()
Out[7]: True
that returns a single boolean because if fails when comparing pd.NA
:
In [5]: bool(pd.NA == pd.NA)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [5], line 1
----> 1 bool(pd.NA == pd.NA)
File ~/mambaforge/envs/dev/lib/python3.10/site-packages/pandas/_libs/missing.pyx:382, in pandas._libs.missing.NAType.__bool__()
TypeError: boolean value of NA is ambiguous
We should probably change the code to handle this case.