Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ColumnTransformer: integer column index in dataframes unexpected behaviour and error (column selectors vs _get_column_indices) #22556

@avm19

Description

@avm19

Describe the bug

Unexpected behaviour when using integer column index in a DataFrame, other than natural ordering [0, 1, ...].

select_int = make_column_selector(dtype_include=np.int_)
ct = ColumnTransformer([('t2', Normalizer(norm="l1"), select_int)])
df1 = pd.DataFrame({'1': [1, 2, 3], '2': [9, 8, 7]})
df2 = pd.DataFrame({1: [1, 2, 3], 2: [9, 8, 7]})
ct.fit_transform(df1) # OK
ct.fit_transform(df2) # IndexError

The only difference between df1 and df2 is the type of column index. In my opinion, the results for these dataframes must be similar, but an error is raised for the latter.

As far as I could see, the problem stems from semantic ambiguity as to when to use iloc-based indexing vs loc-based indexing. In _get_column_indices L382 this decision is based on the type of index and not on the type of the array. Whichever criterion is chosen, if it followed consistently in column selectors, the error shall be avoided. Probably.

Steps/Code to Reproduce

(See above)

Expected Results

(See above)

Actual Results

(See above)

Versions

Python dependencies:
          pip: 22.0.3
   setuptools: 60.8.1
      sklearn: 1.1.dev0
        numpy: 1.22.2
        scipy: 1.8.0
       Cython: 0.29.27
       pandas: 1.3.5
   matplotlib: 3.5.0
       joblib: 1.1.0
threadpoolctl: 3.1.0

commit b28c5bba66529217ceedd497201a684e5d35b73c (upstream/main, origin/main, origin/HEAD, main)
Author: Thomas J. Fan 
Date:   Tue Feb 15 11:46:54 2022 -0500

    FIX DummyRegressor overriding constant (#22486)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions