Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Can't provide feature indices for OneHotEncoder in pipeline #8539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Mar 5, 2017 · 6 comments · Fixed by #9012
Closed

Can't provide feature indices for OneHotEncoder in pipeline #8539

amueller opened this issue Mar 5, 2017 · 6 comments · Fixed by #9012

Comments

@amueller
Copy link
Member

amueller commented Mar 5, 2017

Let's say I want to apply a transformation only to some features in a pipeline, such as imputation or one-hot-encoding (or scaling, which currently doesn't support this).
I could provide the indices of the columns I want to transform. But if there are any previous steps in the pipeline, they might re-arrange the features in some arbitrary way (like OneHotEncoder does).

Example

import numpy as np

# assume the second feature is categorical and the third is continuous
X = [[np.NaN, np.NaN, 5], [np.NaN, 1, 3], [np.NaN, 1, np.NaN]]

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer, OneHotEncoder

pipe = make_pipeline(Imputer(strategy='most_frequent'), OneHotEncoder(categorical_features=[1], sparse=False))

pipe.fit_transform(X)

array([[ 0., 1., 1.],
[ 1., 0., 1.],
[ 1., 0., 1.]])

desired outcome:

array([[ 1., 5.],
[ 1., 3.],
[ 1., 3.]])

Even if each output feature corresponds to exactly one input feature, and we knew which that was, there would be no way to specify this in OneHotEncoder. This might look constructed but is a pretty obvious use-case in which you have per-column meta-data.

The only solution I see is by keeping along a column index (or column names) and allow to pass that.
Given my experience of .iloc vs .loc in pandas, I'm not entirely happy with the prospect.

cc @mfeurer

Conceptually somewhat related to #8480 and scikit-learn/enhancement_proposals#5 as they deal with feature meta-data.

[and then I introduced hierarchical indices over columns into scikit-learn.... not]

@jnothman
Copy link
Member

jnothman commented Mar 5, 2017 via email

@amueller
Copy link
Member Author

amueller commented Mar 5, 2017

don't apologize, I haven't even read it yet :-/

If pipeline uses clone, you can not specify the callback at construction time, right?

@amueller
Copy link
Member Author

amueller commented Mar 5, 2017

The best solution here is just to some names with data;

Sorry, can't parse

@jnothman
Copy link
Member

jnothman commented Mar 5, 2017

The best solution here is just to some names with data

some -> pass

If pipeline uses clone, you can not specify the callback at construction time, right?

Yes, you're right. But it's not about Pipeline using clone, since the problem is already present in the context of cross-validation where the whole pipeline is cloning, so any object references are invalidated. The best we could do without passing around feature descriptions, then, might be for pipeline to inform its transformers of their context. my_transformer.inform_context(containing_pipeline, step_name) before fit. Yuck!

@amueller
Copy link
Member Author

amueller commented Mar 6, 2017

@janvanrijn pointed out that it also means that pipelines have data-dependent parameters, which is a bit ugly. but I guess we can't get around that unless we detect whether input is categorical using dataframes...

@amueller
Copy link
Member Author

amueller commented Mar 6, 2017

I guess once we can handle strings you could cast input columns that should be one-hot-encoded to strings and we could add an option not to encode integers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants