-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Can't provide feature indices for OneHotEncoder in pipeline #8539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes, sorry I've not got back to that SLEP. But it doesn't really help for
this case where a step needs information about the previous steps of a
pipeline. The best solution here is just to some names with data; but
specifying categorical_features but callback would at least make this
possible, though convoluted.
…On 6 Mar 2017 8:16 am, "Andreas Mueller" ***@***.***> wrote:
Let's say I want to apply a transformation only to some features in a
pipeline, such as imputation or one-hot-encoding (or scaling, which
currently doesn't support this).
I could provide the indices of the columns I want to transform. But if
there are any previous steps in the pipeline, they might re-arrange the
features in some arbitrary way (like OneHotEncoder does).
Example
import numpy as np
# assume the second feature is categorical and the third is continuous
X = [[np.NaN, np.NaN, 5], [np.NaN, 1, 3], [np.NaN, 1, np.NaN]]
from sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import Imputer, OneHotEncoder
pipe = make_pipeline(Imputer(strategy='most_frequent'), OneHotEncoder(categorical_features=[1], sparse=False))
pipe.fit_transform(X)
array([[ 0., 1., 1.],
[ 1., 0., 1.],
[ 1., 0., 1.]])
desired outcome:
array([[ 1., 5.],
[ 1., 3.],
[ 1., 3.]])
cc @mfeurer <https://github.com/mfeurer>
Conceptually somewhat related to #8480
<#8480> and
scikit-learn/enhancement_proposals#5
<scikit-learn/enhancement_proposals#5> as they
deal with feature meta-data
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8539>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz66RLOTlYnT-RB1zqQZVzb17L22H7ks5riyYhgaJpZM4MTjXV>
.
|
don't apologize, I haven't even read it yet :-/ If pipeline uses clone, you can not specify the callback at construction time, right? |
Sorry, can't parse |
some -> pass
Yes, you're right. But it's not about Pipeline using clone, since the problem is already present in the context of cross-validation where the whole pipeline is cloning, so any object references are invalidated. The best we could do without passing around feature descriptions, then, might be for pipeline to inform its transformers of their context. |
@janvanrijn pointed out that it also means that pipelines have data-dependent parameters, which is a bit ugly. but I guess we can't get around that unless we detect whether input is categorical using dataframes... |
I guess once we can handle strings you could cast input columns that should be one-hot-encoded to strings and we could add an option not to encode integers? |
Let's say I want to apply a transformation only to some features in a pipeline, such as imputation or one-hot-encoding (or scaling, which currently doesn't support this).
I could provide the indices of the columns I want to transform. But if there are any previous steps in the pipeline, they might re-arrange the features in some arbitrary way (like OneHotEncoder does).
Example
desired outcome:
Even if each output feature corresponds to exactly one input feature, and we knew which that was, there would be no way to specify this in OneHotEncoder. This might look constructed but is a pretty obvious use-case in which you have per-column meta-data.
The only solution I see is by keeping along a column index (or column names) and allow to pass that.
Given my experience of
.iloc
vs.loc
in pandas, I'm not entirely happy with the prospect.cc @mfeurer
Conceptually somewhat related to #8480 and scikit-learn/enhancement_proposals#5 as they deal with feature meta-data.
[and then I introduced hierarchical indices over columns into scikit-learn.... not]
The text was updated successfully, but these errors were encountered: