Thanks to visit codestin.com
Credit goes to github.com

Skip to content

PandasAdapter causes crash or misattributed features #31051

@nicolas-bolle

Description

@nicolas-bolle

Describe the bug

If all the following hold

  • Using ColumnTransformer with the output container set to pandas
  • At least one transformer transforms 1D inputs to 2D outputs (like DictVectorizer)
  • At least one transformer transformers 2D inputs to 2D outputs (like FunctionTransformer)
  • The input is a pandas DataFrame with non-default index

then fit/transform with the ColumnTransformer crashes because of index misalignment, or (in pathological situations) permutes the outputs of some feature transforms making the first data point have some features from the first data point and some features from the second data point.

Steps/Code to Reproduce

import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import FunctionTransformer

df = pd.DataFrame({
    'dict_col': [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}],
    'dummy_col': [1, 2]
}, index=[1, 2])  # replace with [1, 0] for pathological example
                 
t = make_column_transformer(
    (DictVectorizer(sparse=False), 'dict_col'),
    (FunctionTransformer(), ['dummy_col']),
)
t.set_output(transform='pandas')

t.fit_transform(df)

Expected Results

The following features dataframe:

dictvectorizer__bar dictvectorizer__baz dictvectorizer__foo functiontransformer__dummy_col
0 2 0 1 1
1 0 1 3 2

Actual Results

A crash:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 17
     11 t = make_column_transformer(
     12     (DictVectorizer(sparse=False), 'dict_col'),
     13     (FunctionTransformer(), ['dummy_col']),
     14 )
     15 t.set_output(transform='pandas')
---> 17 t.fit_transform(df)

File [~\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\sklearn\utils\_set_output.py:319](http://localhost:8888/lab/tree/~/Documents/Code/Python/Scratchwork/pandas_adapter/Lib/site-packages/sklearn/utils/_set_output.py#line=318), in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    317 @wraps(f)
    318 def wrapped(self, X, *args, **kwargs):
--> 319     data_to_wrap = f(self, X, *args, **kwargs)
    320     if isinstance(data_to_wrap, tuple):
    321         # only wrap the first output for cross decomposition
    322         return_tuple = (
    323             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    324             *data_to_wrap[1:],
    325         )

File [~\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\sklearn\base.py:1389](http://localhost:8888/lab/tree/~/Documents/Code/Python/Scratchwork/pandas_adapter/Lib/site-packages/sklearn/base.py#line=1388), in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1382     estimator._validate_params()
   1384 with config_context(
   1385     skip_parameter_validation=(
   1386         prefer_skip_nested_validation or global_skip_validation
   1387     )
   1388 ):
-> 1389     return fit_method(estimator, *args, **kwargs)

File [~\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\sklearn\compose\_column_transformer.py:1031](http://localhost:8888/lab/tree/~/Documents/Code/Python/Scratchwork/pandas_adapter/Lib/site-packages/sklearn/compose/_column_transformer.py#line=1030), in ColumnTransformer.fit_transform(self, X, y, **params)
   1028 self._validate_output(Xs)
   1029 self._record_output_indices(Xs)
-> 1031 return self._hstack(list(Xs), n_samples=n_samples)

File [~\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\sklearn\compose\_column_transformer.py:1215](http://localhost:8888/lab/tree/~/Documents/Code/Python/Scratchwork/pandas_adapter/Lib/site-packages/sklearn/compose/_column_transformer.py#line=1214), in ColumnTransformer._hstack(self, Xs, n_samples)
   1213     output_samples = output.shape[0]
   1214     if output_samples != n_samples:
-> 1215         raise ValueError(
   1216             "Concatenating DataFrames from the transformer's output lead to"
   1217             " an inconsistent number of samples. The output may have Pandas"
   1218             " Indexes that do not match, or that transformers are returning"
   1219             " number of samples which are not the same as the number input"
   1220             " samples."
   1221         )
   1223     return output
   1225 return np.hstack(Xs)

ValueError: Concatenating DataFrames from the transformer's output lead to an inconsistent number of samples. The output may have Pandas Indexes that do not match, or that transformers are returning number of samples which are not the same as the number input samples.

Or the following for the pathological example (note the two entries in functiontransformer__dummy_col are in the wrong order):

dictvectorizer__bar dictvectorizer__baz dictvectorizer__foo functiontransformer__dummy_col
0 2 0 1 2
1 0 1 3 1

Versions

System:
    python: 3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]
executable: C:\Users\user\Documents\Code\Python\Scratchwork\pandas_adapter\Scripts\python.exe
   machine: Windows-11-10.0.26100-SP0

Python dependencies:
      sklearn: 1.6.1
          pip: 24.2
   setuptools: 77.0.3
        numpy: 2.2.4
        scipy: 1.15.2
       Cython: None
       pandas: 2.2.3
   matplotlib: None
       joblib: 1.4.2
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libscipy_openblas
       filepath: C:\Users\user\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\numpy.libs\libscipy_openblas64_-43e11ff0749b8cbe0a615c9cf6737e0e.dll
        version: 0.3.28
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: vcomp
       filepath: C:\Users\user\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\sklearn\.libs\vcomp140.dll
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libscipy_openblas
       filepath: C:\Users\user\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\scipy.libs\libscipy_openblas-f07f5a5d207a3a47104dca54d6d0c86a.dll
        version: 0.3.28
threading_layer: pthreads
   architecture: Haswell

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions