Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FunctionTransformer need feature_names_out even if func returns DataFrame #28780

Open
@fedorkobak

Description

@fedorkobak

Describe the bug

Trying to call transform for FunctionTransformer for which feature_names_out is configured raises error that advises to use set_output(transform='pandas'). But this doesn't change anything.

Steps/Code to Reproduce

import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer

my_transformer = FunctionTransformer(
    lambda X : pd.concat(
        [
            X[col].rename(f"{col} {str(power)}")**power
            for col in X
            for power in range(2,4)
        ],
        axis=1
    ),
    feature_names_out = (
        lambda transformer, input_features: [
            f"{feature} {power_str}"
            for feature in input_features
            for power_str in ["square", "cubic"]
        ]
    )
)
# I specified transform=pandas
my_transformer.set_output(transform='pandas')
sample_size = 10
X = pd.DataFrame({
    "feature 1" : [1,2,3,4,5],
    "feature 2" : [3,4,5,6,7]
})
my_transformer.fit(X)
my_transformer.transform(X)

Expected Results

pandas.DataFrame like following

feature 1 square feature 1 cubic feature 2 square feature 2 cubic
0 1 1 9 27
1 4 8 16 64
2 9 27 25 125
3 16 84 36 216
4 25 125 49 343

Actual Results

ValueError: The output generated by `func` have different column names than the ones provided by `get_feature_names_out`. Got output with columns names: ['feature 1 2', 'feature 1 3', 'feature 2 2', 'feature 2 3'] and `get_feature_names_out` returned: ['feature 1 square', 'feature 1 cubic', 'feature 2 square', 'feature 2 cubic']. The column names can be overridden by setting `set_output(transform='pandas')` or `set_output(transform='polars')` such that the column names are set to the names provided by `get_feature_names_out`.

Versions

System:
    python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
executable: /usr/bin/python3
   machine: Linux-6.5.0-14-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.4.1.post1
          pip: 24.0
   setuptools: 68.2.2
        numpy: 1.24.2
        scipy: 1.11.1
       Cython: None
       pandas: 2.2.1
   matplotlib: 3.7.1
       joblib: 1.3.1
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/fedor/.local/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Haswell
    num_threads: 12

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/fedor/.local/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 12

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/fedor/.local/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: Haswell
    num_threads: 12

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Easy

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions