Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ColumnTransformer converts pandas extension datatypes to object #27482

@jeromedockes

Description

@jeromedockes

Describe the bug

pandas has some extension data types such as pd.Int64DType and pd.Float64DType that use pd.NA to represent null values.
These datatypes in DataFrames get converted to np.float64 by sklearn.utils.validation.check_array.
If they have missing values, check_array converts them to np.nan and therefore they work fine with scikit-learn estimators that can handle missing values.

However when transformed by a sklearn.compose.ColumnTransformer, pandas dataframes with extension dtypes become Numpy arrays with the object dtype.
When check_array is called on these numpy arrays, the pd.NA conversion (done by calling pd.DataFrame.astype) is not applied and if they contain missing values the conversion fails.

Steps/Code to Reproduce

check_array produces a float64 array, but ColumnTransformer produces an object array:

import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.utils.validation import check_array

X = pd.DataFrame({"A": [0.5]}).convert_dtypes()
print(X["A"].dtype) # Float64

X1 = check_array(X, force_all_finite=False)
print(X1.dtype) # float64

transformer = make_column_transformer(("passthrough", ["A"]))
X2 = transformer.fit_transform(X)
print(X2.dtype) # object

This causes a TypeError if the array has missing values and is later passed to an estimator:

from sklearn.ensemble import HistGradientBoostingRegressor

X = pd.DataFrame({"A": [0.5, None]}).convert_dtypes()
HistGradientBoostingRegressor().fit(X, [0.0, 0.0]) # ok
HistGradientBoostingRegressor().fit(transformer.fit_transform(X), [0.0, 0.0]) # TypeError

Expected Results

The output of the ColumnTransformer for Float64DType inputs would ideally be float64, with missing values represented by np.nan, and fitting the HistGradientBoostingRegressor on them would not raise an error

Actual Results

Float64
float64
object
Traceback (most recent call last):
  File "/tmp/example.py", line 21, in <module>
    HistGradientBoostingRegressor().fit(transformer.fit_transform(X), [0.0, 0.0]) # TypeError
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome/.virtualenvs/df/lib/python3.11/site-packages/sklearn/base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome/.virtualenvs/df/lib/python3.11/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 371, in fit
    X, y = self._validate_data(X, y, dtype=[X_DTYPE], force_all_finite=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome/.virtualenvs/df/lib/python3.11/site-packages/sklearn/base.py", line 621, in _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome/.virtualenvs/df/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1147, in check_X_y
    X = check_array(
        ^^^^^^^^^^^^
  File "/home/jerome/.virtualenvs/df/lib/python3.11/site-packages/sklearn/utils/validation.py", line 917, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jerome/.virtualenvs/df/lib/python3.11/site-packages/sklearn/utils/_array_api.py", line 380, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: float() argument must be a string or a real number, not 'NAType'

Versions

System:
    python: 3.11.5 (main, Aug 25 2023, 13:19:50) [GCC 11.4.0]
executable: /home/jerome/.virtualenvs/df/bin/python
   machine: Linux-6.2.0-33-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.3.0
          pip: 23.2.1
   setuptools: 65.5.0
        numpy: 1.25.2
        scipy: 1.11.2
       Cython: None
       pandas: 2.1.0
   matplotlib: 3.7.3
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 4
         prefix: libgomp
       filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-5007b62f.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Prescott

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: Prescott

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions