-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Describe the bug
pandas has some extension data types such as pd.Int64DType and pd.Float64DType that use pd.NA to represent null values.
These datatypes in DataFrames get converted to np.float64 by sklearn.utils.validation.check_array.
If they have missing values, check_array converts them to np.nan and therefore they work fine with scikit-learn estimators that can handle missing values.
However when transformed by a sklearn.compose.ColumnTransformer, pandas dataframes with extension dtypes become Numpy arrays with the object dtype.
When check_array is called on these numpy arrays, the pd.NA conversion (done by calling pd.DataFrame.astype) is not applied and if they contain missing values the conversion fails.
Steps/Code to Reproduce
check_array produces a float64 array, but ColumnTransformer produces an object array:
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.utils.validation import check_array
X = pd.DataFrame({"A": [0.5]}).convert_dtypes()
print(X["A"].dtype) # Float64
X1 = check_array(X, force_all_finite=False)
print(X1.dtype) # float64
transformer = make_column_transformer(("passthrough", ["A"]))
X2 = transformer.fit_transform(X)
print(X2.dtype) # objectThis causes a TypeError if the array has missing values and is later passed to an estimator:
from sklearn.ensemble import HistGradientBoostingRegressor
X = pd.DataFrame({"A": [0.5, None]}).convert_dtypes()
HistGradientBoostingRegressor().fit(X, [0.0, 0.0]) # ok
HistGradientBoostingRegressor().fit(transformer.fit_transform(X), [0.0, 0.0]) # TypeErrorExpected Results
The output of the ColumnTransformer for Float64DType inputs would ideally be float64, with missing values represented by np.nan, and fitting the HistGradientBoostingRegressor on them would not raise an error
Actual Results
Float64
float64
object
Traceback (most recent call last):
File "/tmp/example.py", line 21, in <module>
HistGradientBoostingRegressor().fit(transformer.fit_transform(X), [0.0, 0.0]) # TypeError
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jerome/.virtualenvs/df/lib/python3.11/site-packages/sklearn/base.py", line 1151, in wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jerome/.virtualenvs/df/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 371, in fit
X, y = self._validate_data(X, y, dtype=[X_DTYPE], force_all_finite=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jerome/.virtualenvs/df/lib/python3.11/site-packages/sklearn/base.py", line 621, in _validate_data
X, y = check_X_y(X, y, **check_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jerome/.virtualenvs/df/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1147, in check_X_y
X = check_array(
^^^^^^^^^^^^
File "/home/jerome/.virtualenvs/df/lib/python3.11/site-packages/sklearn/utils/validation.py", line 917, in check_array
array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jerome/.virtualenvs/df/lib/python3.11/site-packages/sklearn/utils/_array_api.py", line 380, in _asarray_with_order
array = numpy.asarray(array, order=order, dtype=dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: float() argument must be a string or a real number, not 'NAType'
Versions
System:
python: 3.11.5 (main, Aug 25 2023, 13:19:50) [GCC 11.4.0]
executable: /home/jerome/.virtualenvs/df/bin/python
machine: Linux-6.2.0-33-generic-x86_64-with-glibc2.35
Python dependencies:
sklearn: 1.3.0
pip: 23.2.1
setuptools: 65.5.0
numpy: 1.25.2
scipy: 1.11.2
Cython: None
pandas: 2.1.0
matplotlib: 3.7.3
joblib: 1.3.2
threadpoolctl: 3.2.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 4
prefix: libgomp
filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
user_api: blas
internal_api: openblas
num_threads: 4
prefix: libopenblas
filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-5007b62f.3.23.dev.so
version: 0.3.23.dev
threading_layer: pthreads
architecture: Prescott
user_api: blas
internal_api: openblas
num_threads: 4
prefix: libopenblas
filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
version: 0.3.21.dev
threading_layer: pthreads
architecture: Prescott