-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
Open
Labels
Description
Describe the bug
Unexpected behaviour when using integer column index in a DataFrame, other than natural ordering [0, 1, ...].
select_int = make_column_selector(dtype_include=np.int_)
ct = ColumnTransformer([('t2', Normalizer(norm="l1"), select_int)])
df1 = pd.DataFrame({'1': [1, 2, 3], '2': [9, 8, 7]})
df2 = pd.DataFrame({1: [1, 2, 3], 2: [9, 8, 7]})
ct.fit_transform(df1) # OK
ct.fit_transform(df2) # IndexError
The only difference between df1 and df2 is the type of column index. In my opinion, the results for these dataframes must be similar, but an error is raised for the latter.
As far as I could see, the problem stems from semantic ambiguity as to when to use iloc-based indexing vs loc-based indexing. In _get_column_indices L382 this decision is based on the type of index and not on the type of the array. Whichever criterion is chosen, if it followed consistently in column selectors, the error shall be avoided. Probably.
Steps/Code to Reproduce
(See above)
Expected Results
(See above)
Actual Results
(See above)
Versions
Python dependencies:
pip: 22.0.3
setuptools: 60.8.1
sklearn: 1.1.dev0
numpy: 1.22.2
scipy: 1.8.0
Cython: 0.29.27
pandas: 1.3.5
matplotlib: 3.5.0
joblib: 1.1.0
threadpoolctl: 3.1.0
commit b28c5bba66529217ceedd497201a684e5d35b73c (upstream/main, origin/main, origin/HEAD, main)
Author: Thomas J. Fan
Date: Tue Feb 15 11:46:54 2022 -0500
FIX DummyRegressor overriding constant (#22486)AlessandroMiola