-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Unary encoder -- continued #12893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Unary encoder -- continued #12893
Conversation
…rameter. Also added a new test case test_unary_encoder_n_values_array
1. Fixed test failures 2. Updated docs 3. UnaryEncoder: Changed handle_unknown to handle_greater Updated docs UnaryEncoder: Changed handle_unknown to handle_greater
transform. Parameters checking being done in fit method. Plus, new test cases. Updated docs Updating implementation of UnaryEncoder checking handle_greater parameter value in fit New test cases
Removing fit_transform for UnaryEncoder and relying on one defined in TransformerMixin Update handle_greater=error error message Removing fit_transform for UnaryEncoder and relying on one defined in TransformerMixin Update handle_greater=error error message
Updaing warn mode Making warn as default mode
and updating docs Updating docs
Explaining handle_greater options Adding test cases to test warn option of handle_greater parameter Updated warning message updated feature_matrix generation function and made test deterministic making test cases clearer by using n_features n_values and n_features cleanup
Unary encoding | ||
-------------- | ||
|
||
For some ordinal features, it does not necessarily make sense to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about how to motivate this encoder properly. Please feel free to suggest anything better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like "if the difference between ordered categories is uneven".
How about
Ordinal variables can also be represented in unary, where one feature is created for each possible value, as in one-hot encoding, but several features are active at a time (1, 3 and 5 would be encoded as row vectors
[1, 0, 0, 0, 0]
,[1, 1, 1, 0, 0]
and[1, 1, 1, 1, 1]
respectively, for instance).For generalized linear models, a one-hot encoding requires learning a weight for each value of an ordinal variable; an ordinal encoding learns a single weight for the variable, assuming a smooth (e.g. linear) relationship between the variable and target. Unary encoding is not strictly more expressive than one-hot, but allows the model to learn the non-linear effect of increasing an ordinal variable with fewer large or non-zero parameters. For similar reasons, unary may be a worthwhile encoding for linear modelling of discretized variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an example of an ordinal variable would then be helpful
sklearn/preprocessing/_encoders.py
Outdated
|
||
Parameters | ||
---------- | ||
n_values : 'auto', int or array of ints, optional (default='auto') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be changed to categories
to be consistent with other encoders?
Note that in the other encoders categories
is a list of list, while here it would only be a list of int (the list of the maximum) since the input matrix X is supposed to contain non-negative integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should turn to categories
. Maybe we should support string here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand the variables are supposed to be ordinal. How would we represent ordinal features with strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should support things like
enc = preprocessing.UnaryEncoder(categories=[["a", "b", "c"]])
enc.fit_transform([["a"], ["b"], ["c"]])
Although users can combine UnaryEncoder
with OrdinalEncoder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think OrdinalEncoder
should be used for this since its output is precisely the input that UnaryEncoder
requires. I'll add an example in the user guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you force users to use OrdinalEncoder
first? I'd argue that it's not user-friendly :)
I think OrdinalEncoder
should be similar to OneHotEncoder
, but the categories
parameter here is actually similar to the n_values
parameter we previously had in OneHotEncoder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we make categories
being exactly the same as in OrdinalEncoder
, and rework the implementation of UnaryEncoder
to first use a OrdinalEncoder
under the hood, in order to have integer input (as currently required)?
Basically: make UnaryEncoder
work exactly like make_pipeline(OrdinalEncoder(categories), CurrentUnaryEncoder())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically: make UnaryEncoder work exactly like make_pipeline(OrdinalEncoder(categories), CurrentUnaryEncoder())
This is exactly what I thought, I'll vote +1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think this is a good idea too, just want input from @jnothman before reworking this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm happy with making this have input the same as OrdinalEncoder. But I also don't require it for this to be merged.
sparse : boolean, optional (default=False) | ||
Will return sparse matrix if set True else will return an array. | ||
|
||
handle_greater : str, 'warn', 'error' or 'clip', optional (default='warn') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default here is warn while for other encoders it's error. See https://github.com/scikit-learn/scikit-learn/pull/8652/files#r161743736.
I mean that any value unseen in training between 1 and max will have a
feature generated for it, but it will be constant in training hence
obsolescent.
|
So you want to determine the category values with |
This seems unavoidable if we keep current API design (we only have parameter max_value and attribute max_value_)? |
It's a bit of a tricky decision, huh? Using unique values would certainly
make it a more well-behaved transformer in a pipeline context.
|
I think the problem is how do design the API, what's the solution in your mind? @jnothman |
I think the constant features are more likely when dealing with count-valued data than categorical. Sorry I don't have a clear design in mind yet and in general am not being very constructive here and have pushed the conversation around in circles. We are interested in the following use cases:
I'm guessing that 3 will be the most common use case with 1 and 2 following. In all cases, it's probably best to only output features for values either seen in the training or explicitly requested by the user. This adds some code complexity but at least avoids generating useless features. For 1 and 4 we need to consider the input as numeric and output the value corresponding to the nearest bin edge below the input. 3 can be achieved directly by supporting 1. So maybe what we really want is to support arbitrary numeric input with bin_edges, and require pipelining for case 2. Current behaviour could be more-or-less accomplished with |
I think the encoder should focus on encoding data. It's not the duty of an encoder to preprocess data. We can have a discretizer to bin continuous data according to bin_edges specified by users (or somehow add this feature to KBinsDiscretizer), but I guess we don't need to consider it here. I think the main issue here is how to determine the categories. |
Agreed with @qinhanmin2014 that we should restrict the scope of this encoder. Anything bin related should be Supporting so many different scenarios means that the encoder must accept a restricted type of data and behave in a simple, predicted way. What happens before and after should be delegated to other tools. (Again, like in unix). We can use pipelines for that, it's literally what it's designed for. It seems that so far every single option we have proposed was dismissed. I guess that means it is impossible to design a perfect My proposition is to keep it exactly as it is:
EDIT: again, not familiar with the encoder and less acquainted to the whole sklearn tooling so I'll yield to anything you'd deem more relevant. |
Well a lot of estimators (incl things like feature_importances_ and L2
regularisation) are designed in a way that assumes or prefers no perfectly
collinear or redundant features. So when features are counts it's not great
to encode them like that.
But I'm okay with that design... we just need to be sure it handles that
use case reasonably well.
|
For redundant features, at least users can get rid of them with I'm OK with the API design of current PR, though I still feel a little bit strange to clip feature > max_value and reject feature < 0. @jorisvandenbossche any opinion about current API design? Personally I'd like to see more consensus before reviewing it. |
@jorisvandenbossche, @NicolasHug: shall we set a meeting time to find consensus on this? |
The main issues here IMO is: After the long discussion here, I still prefer to support strings and always reject unknown categories. This will introduce an encoder which is similar to |
@jorisvandenbossche, we could call it Base1Encoder to distinguish from other meanings of "unary"?? |
What's new will need moving to 0.22... if we're sure we'll ever resolve the disagreements / find adequate need for this feature. |
I added it to the meeting note. If we have time, maybe we can briefly discuss whether we want this feature. |
I think this is not useful for tree-based models, right? and is it useful for other models? |
Encoders are not really useful for tree based models.
|
We still need OneHotEncoder to handle strings? By saying not useful, I mean when training tree-based models, we can use OrdinalEncoder instead of UnaryEncoder. |
We still need OneHotEncoder to handle strings?
For now ;)
|
It might be a good idea to make this encoder available in the discretizer, but not available as a separate public API until we're happy with its use-cases and API. |
this seems acceptable, but I don't think it's a good solution (so +0 from my side) |
I don't see how this encoding is appropriate for unordered categorical
features.
|
I mean, if we think this encoder is reasonable, maybe we should support both (ordered) categorical features and continuous features. |
I'm not a huge fan of the discretizer also doing encoding. It's convenient until the point where you want a more flexible encoding, e.g. drop a column in the OneHotEncoder. Why don't we encourage users to use pipelines more? |
I'm also +1 to rely more on pipeline (e.g., do not let estimators do encoding, do not let encoder do feature selection.), especially when we can improve the performance of our impoementation by making some assumptions. |
I don't (fore)see this moving forward, so closing. |
Reference Issues/PRs
Closes #8652
What does this implement/fix? Explain your changes.
This PR finishes the work in #8652 that implements a unary encoder.
Changes from #8652:
ordinal_features
parameter, and hence the calls totransform_selected
inverse_transform
I haven't changed any of the existing tests, nor added new ones (except for
inverse_transform
), considering that the original PR was already accepted by @jnothman.Any other comments?
I have added some review/comments where I need additional input from reviewers.