[MRG] Unary encoder -- continued #12893

NicolasHug · 2018-12-30T21:28:58Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR finishes the work in #8652 that implements a unary encoder.

Changes from #8652:

Addressed review from @jorisvandenbossche
Addressed [MRG+1] UnaryEncoder to encode ordinal features into unary levels #8652 (comment): removed the ordinal_features parameter, and hence the calls to transform_selected
Add inverse_transform
Updated user guide

I haven't changed any of the existing tests, nor added new ones (except for inverse_transform), considering that the original PR was already accepted by @jnothman.

Any other comments?

I have added some review/comments where I need additional input from reviewers.

…eded

…to False

…rameter. Also added a new test case test_unary_encoder_n_values_array

1. Fixed test failures 2. Updated docs 3. UnaryEncoder: Changed handle_unknown to handle_greater Updated docs UnaryEncoder: Changed handle_unknown to handle_greater

transform. Parameters checking being done in fit method. Plus, new test cases. Updated docs Updating implementation of UnaryEncoder checking handle_greater parameter value in fit New test cases

Removing fit_transform for UnaryEncoder and relying on one defined in TransformerMixin Update handle_greater=error error message Removing fit_transform for UnaryEncoder and relying on one defined in TransformerMixin Update handle_greater=error error message

Updaing warn mode Making warn as default mode

and updating docs Updating docs

Explaining handle_greater options Adding test cases to test warn option of handle_greater parameter Updated warning message updated feature_matrix generation function and made test deterministic making test cases clearer by using n_features n_values and n_features cleanup

NicolasHug · 2018-12-30T21:30:01Z

doc/modules/preprocessing.rst

+Unary encoding
+--------------
+
+For some ordinal features, it does not necessarily make sense to use


I am not sure about how to motivate this encoder properly. Please feel free to suggest anything better.

I don't like "if the difference between ordered categories is uneven".

How about

Ordinal variables can also be represented in unary, where one feature is created for each possible value, as in one-hot encoding, but several features are active at a time (1, 3 and 5 would be encoded as row vectors [1, 0, 0, 0, 0], [1, 1, 1, 0, 0] and [1, 1, 1, 1, 1] respectively, for instance).

For generalized linear models, a one-hot encoding requires learning a weight for each value of an ordinal variable; an ordinal encoding learns a single weight for the variable, assuming a smooth (e.g. linear) relationship between the variable and target. Unary encoding is not strictly more expressive than one-hot, but allows the model to learn the non-linear effect of increasing an ordinal variable with fewer large or non-zero parameters. For similar reasons, unary may be a worthwhile encoding for linear modelling of discretized variables.

Maybe an example of an ordinal variable would then be helpful

NicolasHug · 2018-12-30T21:32:45Z

sklearn/preprocessing/_encoders.py

+
+    Parameters
+    ----------
+    n_values : 'auto', int or array of ints, optional (default='auto')


Should this be changed to categories to be consistent with other encoders?

Note that in the other encoders categories is a list of list, while here it would only be a list of int (the list of the maximum) since the input matrix X is supposed to contain non-negative integers.

I think we should turn to categories. Maybe we should support string here.

As far as I understand the variables are supposed to be ordinal. How would we represent ordinal features with strings?

I think we should support things like

enc = preprocessing.UnaryEncoder(categories=[["a", "b", "c"]]) enc.fit_transform([["a"], ["b"], ["c"]])

Although users can combine UnaryEncoder with OrdinalEncoder.

Yes I think OrdinalEncoder should be used for this since its output is precisely the input that UnaryEncoder requires. I'll add an example in the user guide.

So you force users to use OrdinalEncoder first? I'd argue that it's not user-friendly :)
I think OrdinalEncoder should be similar to OneHotEncoder, but the categories parameter here is actually similar to the n_values parameter we previously had in OneHotEncoder.

How about we make categories being exactly the same as in OrdinalEncoder, and rework the implementation of UnaryEncoder to first use a OrdinalEncoder under the hood, in order to have integer input (as currently required)?

Basically: make UnaryEncoder work exactly like make_pipeline(OrdinalEncoder(categories), CurrentUnaryEncoder())

Basically: make UnaryEncoder work exactly like make_pipeline(OrdinalEncoder(categories), CurrentUnaryEncoder())

This is exactly what I thought, I'll vote +1.

Ok I think this is a good idea too, just want input from @jnothman before reworking this

Yes, I'm happy with making this have input the same as OrdinalEncoder. But I also don't require it for this to be merged.

NicolasHug · 2018-12-30T21:33:15Z

sklearn/preprocessing/_encoders.py

+    sparse : boolean, optional (default=False)
+        Will return sparse matrix if set True else will return an array.
+
+    handle_greater : str, 'warn', 'error' or 'clip', optional (default='warn')


The default here is warn while for other encoders it's error. See https://github.com/scikit-learn/scikit-learn/pull/8652/files#r161743736.

jnothman · 2019-01-28T05:41:10Z

I mean that any value unseen in training between 1 and max will have a feature generated for it, but it will be constant in training hence obsolescent.

NicolasHug · 2019-01-28T13:04:17Z

So you want to determine the category values with np.unique instead of max?

qinhanmin2014 · 2019-01-28T13:17:29Z

I mean that any value unseen in training between 1 and max will have a feature generated for it, but it will be constant in training hence obsolescent.

This seems unavoidable if we keep current API design (we only have parameter max_value and attribute max_value_)?

jnothman · 2019-01-28T13:34:08Z

It's a bit of a tricky decision, huh? Using unique values would certainly make it a more well-behaved transformer in a pipeline context.

qinhanmin2014 · 2019-01-28T13:47:08Z

It's a bit of a tricky decision, huh? Using unique values would certainly make it a more well-behaved transformer in a pipeline context.

I think the problem is how do design the API, what's the solution in your mind? @jnothman
I can't come up with an elegant solution (i.e., use np.unique when we only have parameter max_value and attribute max_value_), unless we change current API (e.g., the solution proposed by me previously #12893 (comment))
Regarding redundant features, we also have similar problem when we combine OneHotEncoder with KBinsDiscretize, so maybe we need a separate issue to discuss it.

jnothman · 2019-01-28T22:19:22Z

I think the constant features are more likely when dealing with count-valued data than categorical. Sorry I don't have a clear design in mind yet and in general am not being very constructive here and have pushed the conversation around in circles.

We are interested in the following use cases:

Count-valued inputs
Ordered categoricals
Discretised numerics
?performing arbitrary discretisation directly with specified bin edges.

I'm guessing that 3 will be the most common use case with 1 and 2 following.

In all cases, it's probably best to only output features for values either seen in the training or explicitly requested by the user. This adds some code complexity but at least avoids generating useless features.

For 1 and 4 we need to consider the input as numeric and output the value corresponding to the nearest bin edge below the input.

3 can be achieved directly by supporting 1.

So maybe what we really want is to support arbitrary numeric input with bin_edges, and require pipelining for case 2. Current behaviour could be more-or-less accomplished with bin_edges=[arrange(max_value)]*n_features.

qinhanmin2014 · 2019-01-29T13:19:12Z

So maybe what we really want is to support arbitrary numeric input with bin_edges, and require pipelining for case 2. Current behaviour could be more-or-less accomplished with bin_edges=[arrange(max_value)]*n_features.

I think the encoder should focus on encoding data. It's not the duty of an encoder to preprocess data. We can have a discretizer to bin continuous data according to bin_edges specified by users (or somehow add this feature to KBinsDiscretizer), but I guess we don't need to consider it here.

I think the main issue here is how to determine the categories.
(1) If we determine the categories according to max values, our implementation will be similar to current PR (parameter max_value + attribute max_value_, unable to support strings directly).
(2) If we determine the categories according to unique values, out implementation will be similar to current OneHotEncoder (parameter categories + attribute categories_, easy to support strings directly).

NicolasHug · 2019-01-29T13:59:34Z

Agreed with @qinhanmin2014 that we should restrict the scope of this encoder. Anything bin related should be KBinsDiscretizer's job to me. And now that we need to support so many different scenarios, I would also add that anything string related should be handled by OrdinalEncoder.

Supporting so many different scenarios means that the encoder must accept a restricted type of data and behave in a simple, predicted way. What happens before and after should be delegated to other tools. (Again, like in unix). We can use pipelines for that, it's literally what it's designed for.

It seems that so far every single option we have proposed was dismissed. I guess that means it is impossible to design a perfect UnaryEncoder that would satisfy all the constraints. I think it's time though that we settle on something. It's better to have something usable than to have nothing.

My proposition is to keep it exactly as it is:

no string support
only integer-valued features from 0 to max
at predict a feature > max will be clipped
anything < 0 raises an error
users can use pipeline
drawback: this creates constant features if a a value is not observed during fit. I personally don't see this being such a big problem

EDIT: again, not familiar with the encoder and less acquainted to the whole sklearn tooling so I'll yield to anything you'd deem more relevant.

jnothman · 2019-01-29T21:41:06Z

Well a lot of estimators (incl things like feature_importances_ and L2 regularisation) are designed in a way that assumes or prefers no perfectly collinear or redundant features. So when features are counts it's not great to encode them like that. But I'm okay with that design... we just need to be sure it handles that use case reasonably well.

qinhanmin2014 · 2019-01-30T12:42:38Z

Well a lot of estimators (incl things like feature_importances_ and L2 regularisation) are designed in a way that assumes or prefers no perfectly collinear or redundant features. So when features are counts it's not great to encode them like that.

For redundant features, at least users can get rid of them with sklearn.feature_selection.VarianceThreshold

I'm OK with the API design of current PR, though I still feel a little bit strange to clip feature > max_value and reject feature < 0.

@jorisvandenbossche any opinion about current API design? Personally I'd like to see more consensus before reviewing it.

jnothman · 2019-02-26T06:57:38Z

@jorisvandenbossche, @NicolasHug: shall we set a meeting time to find consensus on this?

qinhanmin2014 · 2019-02-26T07:43:29Z

The main issues here IMO is:
(1) Whether we should support string directly or rely on OrdinalEncoder + UnaryEncoder pipeline. If we support string, seems that we need to reject unknown features.
(2) If we do not support string, whether it's reasonable to reject feature < 0 and clip feature > max.
Other issues:
(1) How to treat redundant features? (Maybe we can simply tell users UnaryEncoder might produce constant features and these features can be removed with feature selection algorithms. We did so in KBinsDiscretizer recently, see #13165 )
(2) Do we have a better name? Are there any references/similar implementations? Share some application scenario if someone is familiar with this encoder?
(3) How to construct an good example (e.g., use which dataset)?

After the long discussion here, I still prefer to support strings and always reject unknown categories. This will introduce an encoder which is similar to OneHotEncoder (the new version).

jnothman · 2019-02-26T08:51:57Z

@jorisvandenbossche, we could call it Base1Encoder to distinguish from other meanings of "unary"??

jnothman · 2019-05-01T06:57:00Z

What's new will need moving to 0.22... if we're sure we'll ever resolve the disagreements / find adequate need for this feature.

qinhanmin2014 · 2019-08-05T03:56:23Z

I added it to the meeting note. If we have time, maybe we can briefly discuss whether we want this feature.

qinhanmin2014 · 2019-08-05T03:57:07Z

I think this is not useful for tree-based models, right? and is it useful for other models?

jnothman · 2019-08-05T08:00:40Z

Encoders are not really useful for tree based models.

qinhanmin2014 · 2019-08-05T08:18:27Z

Encoders are not really useful for tree based models.

We still need OneHotEncoder to handle strings? By saying not useful, I mean when training tree-based models, we can use OrdinalEncoder instead of UnaryEncoder.

jnothman · 2019-08-05T11:03:14Z

We still need OneHotEncoder to handle strings?

For now ;)

jnothman · 2019-08-18T12:03:31Z

It might be a good idea to make this encoder available in the discretizer, but not available as a separate public API until we're happy with its use-cases and API.

qinhanmin2014 · 2019-08-18T13:55:13Z

It might be a good idea to make this encoder available in the discretizer, but not available as a separate public API until we're happy with its use-cases and API.

this seems acceptable, but I don't think it's a good solution (so +0 from my side)
If we do so, we actually provide users with a way to encode continuous features, but do not provide users with a way to encode categorical features.

jnothman · 2019-08-18T20:52:06Z

I don't see how this encoding is appropriate for unordered categorical features.

qinhanmin2014 · 2019-08-19T00:28:08Z

I don't see how this encoding is appropriate for unordered categorical
features.

I mean, if we think this encoder is reasonable, maybe we should support both (ordered) categorical features and continuous features.

NicolasHug · 2019-08-20T14:39:58Z

I'm not a huge fan of the discretizer also doing encoding. It's convenient until the point where you want a more flexible encoding, e.g. drop a column in the OneHotEncoder.

Why don't we encourage users to use pipelines more?

qinhanmin2014 · 2019-08-21T00:34:26Z

I'm not a huge fan of the discretizer also doing encoding. It's convenient until the point where you want a more flexible encoding, e.g. drop a column in the OneHotEncoder.

I'm also +1 to rely more on pipeline (e.g., do not let estimators do encoding, do not let encoder do feature selection.), especially when we can improve the performance of our impoementation by making some assumptions.

NicolasHug · 2020-02-23T15:08:06Z

I don't (fore)see this moving forward, so closing.

arjunjauhari and others added 22 commits November 28, 2017 02:27

Rebase and commit the OrdinalEncoder implementation

38537ed

Updating name to UnaryEncoder and adding single quote in error string

a850082

Merged changes from scikit-learn#9216

661951c

Removing active_features_ attribute from UnaryEncoder as it is not ne…

eb8bc94

…eded

Limiting the lines in documentation to less that 80 chars

88d5eb4

Updated documentation. Changed the default value of sparse parameter …

cd21cbf

…to False

Updated test cases to accomodate change in default value of sparse pa…

81af018

…rameter. Also added a new test case test_unary_encoder_n_values_array

Commit to accomodate all the requested changes

f4ba310

1. Fixed test failures 2. Updated docs 3. UnaryEncoder: Changed handle_unknown to handle_greater Updated docs UnaryEncoder: Changed handle_unknown to handle_greater

Fixing DocTestFailure

0706c29

Refactoring the code. Now fit_transform in equivalent to fit +

b642a7e

transform. Parameters checking being done in fit method. Plus, new test cases. Updated docs Updating implementation of UnaryEncoder checking handle_greater parameter value in fit New test cases

Adding warn as a new option for handle_greater parameter.

9f3205d

Updaing warn mode Making warn as default mode

Updating test case to take care of new handle_greater='warn' as default

c23ec8d

and updating docs Updating docs

Resolved conflicts with master

a43dfb5

Merge branch 'master' into unary_encoder_continued

7cbf733

some doc modif, TBC

e2a01bb

removed ordinal_features

fc6a9af

Removed calls to _transform_selected

645f6a3

Addressed Joris' comments

4691026

Added inverse tranform

35cbbe4

Updated user guide

b4609fe

NicolasHug commented Dec 30, 2018

View reviewed changes

NicolasHug mentioned this pull request Dec 30, 2018

[MRG+1] UnaryEncoder to encode ordinal features into unary levels #8652

Closed

5 tasks

NicolasHug added 3 commits December 30, 2018 16:38

Added whatsnew

98c0e5a

Removed from __all__ in preprocesing.data

7886fb7

Fixed typo

e96b438

NicolasHug changed the title ~~[WIP] Unary encoder -- continued~~ [MRG] Unary encoder -- continued Dec 30, 2018

qinhanmin2014 mentioned this pull request Feb 14, 2019

FIX Remove bins whose width <= 0 with a warning in KBinsDiscretizer #13165

Merged

NicolasHug mentioned this pull request Aug 5, 2019

[MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder #13833

Closed

4 tasks

NicolasHug closed this Feb 23, 2020

cmarmo mentioned this pull request Dec 16, 2021

UnaryEncoder #8628

Open

Uh oh!

[MRG] Unary encoder -- continued #12893

[MRG] Unary encoder -- continued #12893

Uh oh!

Conversation

NicolasHug commented Dec 30, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 28, 2019 via email

Uh oh!

NicolasHug commented Jan 28, 2019

Uh oh!

qinhanmin2014 commented Jan 28, 2019

Uh oh!

jnothman commented Jan 28, 2019 via email

Uh oh!

qinhanmin2014 commented Jan 28, 2019

Uh oh!

jnothman commented Jan 28, 2019

Uh oh!

qinhanmin2014 commented Jan 29, 2019

Uh oh!

NicolasHug commented Jan 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Jan 29, 2019 via email

Uh oh!

qinhanmin2014 commented Jan 30, 2019

Uh oh!

jnothman commented Feb 26, 2019

Uh oh!

qinhanmin2014 commented Feb 26, 2019

Uh oh!

jnothman commented Feb 26, 2019

Uh oh!

jnothman commented May 1, 2019

Uh oh!

qinhanmin2014 commented Aug 5, 2019

Uh oh!

qinhanmin2014 commented Aug 5, 2019

Uh oh!

jnothman commented Aug 5, 2019 via email

Uh oh!

qinhanmin2014 commented Aug 5, 2019

Uh oh!

jnothman commented Aug 5, 2019 via email

Uh oh!

jnothman commented Aug 18, 2019

Uh oh!

qinhanmin2014 commented Aug 18, 2019

NicolasHug commented Jan 29, 2019 •

edited

Loading

NicolasHug commented Feb 23, 2020 •

edited

Loading