-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Closed
Labels
Description
Describe the bug
If all the following hold
- Using ColumnTransformer with the output container set to pandas
- At least one transformer transforms 1D inputs to 2D outputs (like DictVectorizer)
- At least one transformer transformers 2D inputs to 2D outputs (like FunctionTransformer)
- The input is a pandas DataFrame with non-default index
then fit/transform with the ColumnTransformer crashes because of index misalignment, or (in pathological situations) permutes the outputs of some feature transforms making the first data point have some features from the first data point and some features from the second data point.
Steps/Code to Reproduce
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import FunctionTransformer
df = pd.DataFrame({
'dict_col': [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}],
'dummy_col': [1, 2]
}, index=[1, 2]) # replace with [1, 0] for pathological example
t = make_column_transformer(
(DictVectorizer(sparse=False), 'dict_col'),
(FunctionTransformer(), ['dummy_col']),
)
t.set_output(transform='pandas')
t.fit_transform(df)Expected Results
The following features dataframe:
| dictvectorizer__bar | dictvectorizer__baz | dictvectorizer__foo | functiontransformer__dummy_col | |
|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 1 |
| 1 | 0 | 1 | 3 | 2 |
Actual Results
A crash:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[3], line 17
11 t = make_column_transformer(
12 (DictVectorizer(sparse=False), 'dict_col'),
13 (FunctionTransformer(), ['dummy_col']),
14 )
15 t.set_output(transform='pandas')
---> 17 t.fit_transform(df)
File [~\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\sklearn\utils\_set_output.py:319](http://localhost:8888/lab/tree/~/Documents/Code/Python/Scratchwork/pandas_adapter/Lib/site-packages/sklearn/utils/_set_output.py#line=318), in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
317 @wraps(f)
318 def wrapped(self, X, *args, **kwargs):
--> 319 data_to_wrap = f(self, X, *args, **kwargs)
320 if isinstance(data_to_wrap, tuple):
321 # only wrap the first output for cross decomposition
322 return_tuple = (
323 _wrap_data_with_container(method, data_to_wrap[0], X, self),
324 *data_to_wrap[1:],
325 )
File [~\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\sklearn\base.py:1389](http://localhost:8888/lab/tree/~/Documents/Code/Python/Scratchwork/pandas_adapter/Lib/site-packages/sklearn/base.py#line=1388), in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1382 estimator._validate_params()
1384 with config_context(
1385 skip_parameter_validation=(
1386 prefer_skip_nested_validation or global_skip_validation
1387 )
1388 ):
-> 1389 return fit_method(estimator, *args, **kwargs)
File [~\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\sklearn\compose\_column_transformer.py:1031](http://localhost:8888/lab/tree/~/Documents/Code/Python/Scratchwork/pandas_adapter/Lib/site-packages/sklearn/compose/_column_transformer.py#line=1030), in ColumnTransformer.fit_transform(self, X, y, **params)
1028 self._validate_output(Xs)
1029 self._record_output_indices(Xs)
-> 1031 return self._hstack(list(Xs), n_samples=n_samples)
File [~\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\sklearn\compose\_column_transformer.py:1215](http://localhost:8888/lab/tree/~/Documents/Code/Python/Scratchwork/pandas_adapter/Lib/site-packages/sklearn/compose/_column_transformer.py#line=1214), in ColumnTransformer._hstack(self, Xs, n_samples)
1213 output_samples = output.shape[0]
1214 if output_samples != n_samples:
-> 1215 raise ValueError(
1216 "Concatenating DataFrames from the transformer's output lead to"
1217 " an inconsistent number of samples. The output may have Pandas"
1218 " Indexes that do not match, or that transformers are returning"
1219 " number of samples which are not the same as the number input"
1220 " samples."
1221 )
1223 return output
1225 return np.hstack(Xs)
ValueError: Concatenating DataFrames from the transformer's output lead to an inconsistent number of samples. The output may have Pandas Indexes that do not match, or that transformers are returning number of samples which are not the same as the number input samples.
Or the following for the pathological example (note the two entries in functiontransformer__dummy_col are in the wrong order):
| dictvectorizer__bar | dictvectorizer__baz | dictvectorizer__foo | functiontransformer__dummy_col | |
|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 |
| 1 | 0 | 1 | 3 | 1 |
Versions
System:
python: 3.12.6 (tags/v3.12.6:a4a2d2b, Sep 6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]
executable: C:\Users\user\Documents\Code\Python\Scratchwork\pandas_adapter\Scripts\python.exe
machine: Windows-11-10.0.26100-SP0
Python dependencies:
sklearn: 1.6.1
pip: 24.2
setuptools: 77.0.3
numpy: 2.2.4
scipy: 1.15.2
Cython: None
pandas: 2.2.3
matplotlib: None
joblib: 1.4.2
threadpoolctl: 3.6.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 12
prefix: libscipy_openblas
filepath: C:\Users\user\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\numpy.libs\libscipy_openblas64_-43e11ff0749b8cbe0a615c9cf6737e0e.dll
version: 0.3.28
threading_layer: pthreads
architecture: Haswell
user_api: openmp
internal_api: openmp
num_threads: 12
prefix: vcomp
filepath: C:\Users\user\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\sklearn\.libs\vcomp140.dll
version: None
user_api: blas
internal_api: openblas
num_threads: 12
prefix: libscipy_openblas
filepath: C:\Users\user\Documents\Code\Python\Scratchwork\pandas_adapter\Lib\site-packages\scipy.libs\libscipy_openblas-f07f5a5d207a3a47104dca54d6d0c86a.dll
version: 0.3.28
threading_layer: pthreads
architecture: Haswell