Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add column selector to Imputer #6967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mfeurer opened this issue Jul 8, 2016 · 10 comments
Closed

Add column selector to Imputer #6967

mfeurer opened this issue Jul 8, 2016 · 10 comments

Comments

@mfeurer
Copy link
Contributor

mfeurer commented Jul 8, 2016

Currently, the Imputer works on columns. Let's assume I have a dataset with mixed categorical and numerical data points and missing values and I want to use the pipeline object:

import numpy as np
import scipy.stats
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.ensemble

categorical_data = np.random.randint(0, 5, size=(10, 4))
numerical_data = np.random.randn(10, 4)
data = np.hstack((categorical_data, numerical_data))

# Add missing values
data[0, 1] = np.NaN
data[1, 5] = np.NaN

X = data
y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

If I now try to use imputation, I can only choose a single strategy for both data types:

pipeline = sklearn.pipeline.Pipeline((('imp',
                                       sklearn.preprocessing.Imputer(
                                           strategy='most_frequent')),
                                      ('ohe',
                                       sklearn.preprocessing.OneHotEncoder(
                                           categorical_features=[0, 1, 2, 3])),
                                      ('rf',
                                       sklearn.ensemble.RandomForestClassifier())))

which would result in a more or less random value being picked for the missing value of the continuous feature:

Xt, _ = pipeline._pre_transform(X, y)
# Now uses the return value of scipy.stats.mode which makes only very little
# sense for a continuous value
print(scipy.stats.mode(data)[5])
print(Xt.toarray()[1])

To overcome this, I propose to add a new attribute to the Imputer which allows to specify the columns to be imputed in order to allow something like this:

pipeline = sklearn.pipeline.Pipeline((('imp_cat',
                                       sklearn.preprocessing.Imputer(
                                           strategy='most_frequent', 
                                           columns=[0, 1, 2, 3])),
                                      ('imp_num',
                                       sklearn.preprocessing.Imputer(
                                           strategy='median',
                                           columns=[4, 5, 6, 7])),
                                      ('ohe',
                                       sklearn.preprocessing.OneHotEncoder(
                                           categorical_features=[0, 1, 2, 3])),
                                      ('rf',
                                       sklearn.ensemble.RandomForestClassifier())))

I could not find a related issue and am willing to work on this if this is considered worth adding to scikit-learn.

@raghavrv
Copy link
Member

raghavrv commented Jul 8, 2016

I think it would be a good idea to add a column selector, impute_features maybe. @jnothman @agramfort @amueller thoughts?

@mfeurer
Copy link
Contributor Author

mfeurer commented Jul 8, 2016

I just found that the same issue could also be handled as in this example or by waiting for #2034. This is probably the more general way to go?

@jnothman
Copy link
Member

jnothman commented Jul 9, 2016

I'm somewhat ambivalent about this for that reason. It would be lovely to see something like #3886 merged.

@mfeurer
Copy link
Contributor Author

mfeurer commented Jul 11, 2016

What about adding the ItemSelector from the example to scikit-learn until #3886 is merged instead? The code is already there and would prevent people from duplicating code if they want to achieve what's done in the example.

@amueller amueller added this to the 0.18 milestone Jul 28, 2016
@amueller
Copy link
Member

Hm.... Maybe add a ColumnSelector (which I find a more intuitive name than ItemSelector) and then add something that implements #2034?
I think @GaelVaroquaux's main concern was that he doesn't want pandas-like indexing in the preprocessing package. Though this would be useful even if your data is not in a pandas DataFrame. We probably need a robust way so select columns.
Should we add sklearn.preprocessing.mixed? Or rather sklearn.feature_extraction.mixed? I feel it's more preprocessing than feature extraction, though @GaelVaroquaux suggested to put the new thing in feature_extraction.heterogeneous IIRC.

I'd really like to solve @mfeurer's issue... [I'm doing binge reviewing now, and will then prioritize what to work on. Ping me again if you haven't heard from me in a week]

@amueller
Copy link
Member

( I just wrote this and I'm not sure if I should go home for the day https://gist.github.com/amueller/643f812a275a9e0c75048aab6988a92c)

@amueller
Copy link
Member

untagging 0.18

@stroykova
Copy link

stroykova commented Nov 29, 2017

Hello!
I had the same problem. And I found a way to do this with pipelines on a stackoverflow.

from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import FunctionTransformer, Imputer

def get_num_cols(df):
    return df[['Age']]

vec = make_union(*[
    make_pipeline(FunctionTransformer(get_num_cols, validate=False), Imputer(strategy='mean')),
])

Does it solve this issue? Seems like it was developed after the issue was created.

@jnothman
Copy link
Member

jnothman commented Nov 29, 2017 via email

@jnothman
Copy link
Member

I think we'll close this given ColumnTransformer, and if the issue is still acute, we'll see it raised again...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants