-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Global normalization and sparse matrix support for MinMaxScaler #1799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
awesome, thanks :) |
You didn't add the updates c file ;) |
yep, I didn't want to clutter the commit log at this stage. I'll add some tests first and then care about this whiny Travis guy. |
cdef unsigned int ind | ||
cdef double v | ||
|
||
# means[j] contains the minimum of feature j |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong comment :)
Please feel free to checkin the .c file as well. The diff clutter can be controlled by: https://github.com/scikit-learn/scikit-learn/blob/master/.gitattributes |
This looks like a great contrib @temporaer . Looking forward to the tests and some documentation updates in the preprocessing section. |
data_range = data_max - data_min | ||
else: | ||
# TODO why would anyone want to force a copy here? (this was here | ||
# before, I believe) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the copy is necessary. It's only relevant in the transform
method.
@temporaer could you please squash the commits? (because that's something we do now ^^) |
uh... squash /all/ of them into one? Or just clean up a little? |
Preferably everything. That keeps the history slim, makes reverting easy, prevents broken states in the history, and hides our mistakes from the users ;) |
Thanks. +1 for merge :) |
Set to False to normalize using the minimum/maximum of all | ||
features, not per column | ||
|
||
assume_contains_zeros : boolean, optional, default is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I really don't understand this explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about: If True, normalize all columns separately. Set to False to normalize using minimum and maximum of X
. Maybe also rename to per_column
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't referring to per_feature
, that's fine as it is. I don't understand the behavior of assume_contains_zeros
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm OK, there was only one "explanation" in the excerpt that github selected :-)
How about:
MinMaxScaler
does not work for sparse matrices when the additive component of the normalization is non-zero. When normalizing to the range of [0,1]
, this implies that the minimum in every column of X
must be 0
. Setting assume_contains_zeros=True
enforces this constraint even if some columns of X
are strictly positive.
@larsmans ok for merge? |
The code looks alright, but I still don't understand the |
I agree. I'm also not sure what it is we actually want in the sparse case. So maybe we want an If the input if strictly positive, between In all other cases, something a bit weird would happen, though |
Making centering optional sounds good to me, it is somewhat more intuitive than "injecting" artificial zeros. We'd still need to explain the reasoning why the user would want to change default behavior, and there the complexity would be unchanged. I have no preference here and I'm happy to implement what you API guys decide :-) Andreas Mueller [email protected] schrieb:
|
Any opinion @larsmans ? |
I think |
I wouldn't call it |
It's time to get this merged. @amueller, what name do you suggest? |
I would have called it |
Hope it's all right now... |
Could you please rebase on top of master (or merge master to your branch) and fix any conflict so as to make github happy? |
@ogrisel rebase done, travis happy ;-) |
where differences between features with a small range will influence | ||
distances, and thus classification, less). You may want to keep this | ||
relative importance information and still rescale your features to a | ||
sensible range. This can be done by setting ``per_feature``:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=> "This can be done by setting per_feature
to False
:"
|
||
with_offset : boolean, optional, default is True for dense, | ||
and False for sparse X. | ||
If False, no additive normalization is used, only scaling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could rephrase this to make it more explicit? What is "additive transformation"?
Apart from the lack of documentation for the |
@ogrisel we discussed the |
Still better documentation (with example) is required because right now I am not sure what it actually does. |
@temporaer do you plan to finish the doc work on this PR or would you like someone else to take over the PR from there? |
@ogrisel when I tried last, it turned out to be more than just doc work. I'll try to to produce something discussable in the next few days. |
Thanks! |
Hi there! I'm working on something similar over in #2514, and I think if we might end up replicating effort (I was in fact already looking into merging your PR into mine) :) Maybe we should hash something out together before we both waste time? Where are you currently stuck/having problems? |
Tonights favoured (by me) solution to the semantic problems: if data is sparse and the offset is non-zero, raise a value error. That means that on the off-chance that a sparse array doesn't contain a zero (and it is scaled between, say 0 and 1), it will bail. Pro:
Con:
The second best option: make MinMaxScaler densify if it has to and make an additional estimator that only scales data. |
Ok even better (though deprecation cycle inducing) idea: remove |
I personally was always wondering why the option to add something to the data existed, I figured someone must've had a use case for it. What I came up with last night was making the behavior dependent on the
|
When I first implemented this, I had something quite like Andy's solution. With one addition, however: I had a parameter Pro:
Con:
|
Different behavior on sparse and dense has turned out to be very confusing for users. I've been trying to get rid of that everywhere I can. |
The |
Does anyone have a usecase that needs the current functionality of |
Superseded by #4828. |
tests working & doc in place, needs feedback and merge if OK