Global normalization and sparse matrix support for MinMaxScaler #1799

temporaer · 2013-03-21T17:53:10Z

tests working & doc in place, needs feedback and merge if OK

amueller · 2013-03-21T22:02:13Z

awesome, thanks :)
Could it be that you forgot to add some files / modify the setup.py? Travis is complaining.

amueller · 2013-03-21T22:02:40Z

You didn't add the updates c file ;)

temporaer · 2013-03-21T22:31:29Z

yep, I didn't want to clutter the commit log at this stage. I'll add some tests first and then care about this whiny Travis guy.

mblondel · 2013-03-22T13:20:15Z

sklearn/utils/sparsefuncs.pyx

+    cdef unsigned int ind
+    cdef double v
+
+    # means[j] contains the minimum of feature j


wrong comment :)

ogrisel · 2013-03-22T17:14:00Z

yep, I didn't want to clutter the commit log at this stage.

Please feel free to checkin the .c file as well. The diff clutter can be controlled by:

https://github.com/scikit-learn/scikit-learn/blob/master/.gitattributes

ogrisel · 2013-03-22T17:16:47Z

This looks like a great contrib @temporaer . Looking forward to the tests and some documentation updates in the preprocessing section.

larsmans · 2013-03-27T13:46:30Z

sklearn/preprocessing.py

+            data_range = data_max - data_min
+        else:
+            # TODO why would anyone want to force a copy here? (this was here
+            #      before, I believe)


I don't think the copy is necessary. It's only relevant in the transform method.

amueller · 2013-04-02T13:02:00Z

@temporaer could you please squash the commits? (because that's something we do now ^^)

temporaer · 2013-04-02T14:08:05Z

uh... squash /all/ of them into one? Or just clean up a little?

larsmans · 2013-04-02T14:09:56Z

Preferably everything. That keeps the history slim, makes reverting easy, prevents broken states in the history, and hides our mistakes from the users ;)

amueller · 2013-04-02T16:36:58Z

Thanks. +1 for merge :)

larsmans · 2013-04-16T14:15:57Z

sklearn/preprocessing.py

+        Set to False to normalize using the minimum/maximum of all
+        features, not per column
+
+    assume_contains_zeros : boolean, optional, default is False


I'm sorry, I really don't understand this explanation.

How about: If True, normalize all columns separately. Set to False to normalize using minimum and maximum of X. Maybe also rename to per_column?

I wasn't referring to per_feature, that's fine as it is. I don't understand the behavior of assume_contains_zeros.

Hm OK, there was only one "explanation" in the excerpt that github selected :-)

How about:
MinMaxScaler does not work for sparse matrices when the additive component of the normalization is non-zero. When normalizing to the range of [0,1], this implies that the minimum in every column of X must be 0. Setting assume_contains_zeros=True enforces this constraint even if some columns of X are strictly positive.

amueller · 2013-05-05T12:47:11Z

@larsmans ok for merge?

larsmans · 2013-05-05T12:57:13Z

The code looks alright, but I still don't understand the assume_contains_zeros option, and I want the API to be clear before merging this.

amueller · 2013-05-05T13:12:07Z

I agree. I'm also not sure what it is we actually want in the sparse case.
I think the default option should keep sparsity structure, as this is how we did it in other cases afaik (RandomizedPCA for example), and the user needs to be explicit to destroy sparsity structure.

So maybe we want an with_offset option which is True for dense and False for sparse input.

If the input if strictly positive, between 0 and x and the given range is [0, a] then the right thing will happen.
If the input is between -x and x and the range is [-a, a] also the right thing will happen.

In all other cases, something a bit weird would happen, though
If the input is between -x and x and you ask for [0, a] you would get [0, a/2] I think... well...

temporaer · 2013-05-05T15:58:04Z

Making centering optional sounds good to me, it is somewhat more intuitive than "injecting" artificial zeros. We'd still need to explain the reasoning why the user would want to change default behavior, and there the complexity would be unchanged. I have no preference here and I'm happy to implement what you API guys decide :-)

Andreas Mueller [email protected] schrieb:

I agree. I'm also not sure what it is we actually want in the sparse case.
I think the default option should keep sparsity structure, as this is how we did it in other cases afaik (RandomizedPCA for example), and the user needs to be explicit to destroy sparsity structure.

So maybe we want an with_offset option which is True for dense and False for sparse input.

If the input if strictly positive, between 0 and x and the given range is [0, a] then the right thing will happen.
If the input is between -x and x and the range is [-a, a] also the right thing will happen.

In all other cases, something a bit weird would happen, though
If the input is between -x and x and you ask for [0, a] you would get [0, a/2] I think... well...

—
Reply to this email directly or view it on GitHub.

amueller · 2013-05-06T08:12:47Z

Any opinion @larsmans ?

larsmans · 2013-05-06T09:14:23Z

I think center="auto" with a default value of issparse(X) and an exception for the combination of True and dense arrays would be fine.

amueller · 2013-05-06T09:17:13Z

I wouldn't call it 'center'. And you probably mean "sparse" + True is an exception, right?

larsmans · 2013-05-20T21:04:27Z

It's time to get this merged. @amueller, what name do you suggest?

amueller · 2013-05-21T07:42:55Z

I would have called it with_offset (similar to with_mean).

temporaer · 2013-06-30T23:46:15Z

Hope it's all right now...

ogrisel · 2013-07-09T13:25:49Z

Could you please rebase on top of master (or merge master to your branch) and fix any conflict so as to make github happy?

temporaer · 2013-07-10T10:07:56Z

@ogrisel rebase done, travis happy ;-)

ogrisel · 2013-07-10T10:33:14Z

doc/modules/preprocessing.rst

+where differences between features with a small range will influence
+distances, and thus classification, less). You may want to keep this
+relative importance information and still rescale your features to a
+sensible range. This can be done by setting ``per_feature``::


=> "This can be done by setting per_feature to False:"

ogrisel · 2013-07-10T10:38:34Z

sklearn/preprocessing.py

+
+    with_offset : boolean, optional, default is True for dense, 
+        and False for sparse X.
+        If False, no additive normalization is used, only scaling.


Could rephrase this to make it more explicit? What is "additive transformation"?

ogrisel · 2013-07-10T10:40:18Z

Apart from the lack of documentation for the with_offset option, it looks good. I am not sure about the fact that the behavior is not the same for sparse matrix and dense arrays is different. A naive user would likely expect the default behavior to be the same (as long as min_ == 0).

temporaer · 2013-07-10T13:55:38Z

@ogrisel we discussed the with_offset behavior above. Summary: My first implementation was to keep the behavior the same for dense/sparse. The complicating issue here is that sparse matrices include implicit zeros, which needs to be taken into account. You could then argue that sparse matrices are /meant/ to have zeros, so zero should always be part of the input range. A sparse matrix which does not have zeros would then behave differently than its dense counterpart. The original implementation had a parameter assume_contains_zeros, with default True for sparse matrices. That was too complicated to document and non-intuitive for naive users. I think the solution as it is now (by default, only use scaling for sparse) is the better of the two evils.

ogrisel · 2013-07-10T18:27:11Z

Still better documentation (with example) is required because right now I am not sure what it actually does.

ogrisel · 2013-10-21T09:17:23Z

@temporaer do you plan to finish the doc work on this PR or would you like someone else to take over the PR from there?

temporaer · 2013-10-21T09:59:23Z

@ogrisel when I tried last, it turned out to be more than just doc work. I'll try to to produce something discussable in the next few days.

ogrisel · 2013-10-21T11:21:54Z

Thanks!

untom · 2013-10-21T21:28:52Z

Hi there!

I'm working on something similar over in #2514, and I think if we might end up replicating effort (I was in fact already looking into merging your PR into mine) :) Maybe we should hash something out together before we both waste time?

Where are you currently stuck/having problems?

amueller · 2013-10-22T05:38:23Z

Tonights favoured (by me) solution to the semantic problems: if data is sparse and the offset is non-zero, raise a value error. That means that on the off-chance that a sparse array doesn't contain a zero (and it is scaled between, say 0 and 1), it will bail.

Pro:

when it computes something, it is consistent with the dense case
It never makes a matrix dense
If it computes something, the result is very natural

Con:

it might fail during a GridSearch.

@larsmans @ogrisel wdyt?

The second best option: make MinMaxScaler densify if it has to and make an additional estimator that only scales data.

amueller · 2013-10-22T05:46:16Z

Ok even better (though deprecation cycle inducing) idea: remove MinMaxScaler as no-one needs that functionality anyway. Instead introduce MaxAbsScaler, that scales the data to have a given maximum absolute value, and never adds anything to the data.

untom · 2013-10-22T05:56:03Z

I personally was always wondering why the option to add something to the data existed, I figured someone must've had a use case for it. What I came up with last night was making the behavior dependent on the feature_range argument. On dense input, keep the current behavior, while on sparse input:

throw execption if range is [+a, +b] or [-a, -b] (i.e., does not cross zero, hence would destroy sparsity)
throw exception if range is [0, +a], but the data contains values < 0: (i.e., it would destroy zeros in at least one feature).... same if range is [-a, 0] and data contains values > 0.
if range is [-a, +b], scale all nonzero values (i.e., zeros will be preserved).... this behavior is different from what the dense part does, but it should be obvious why and it is easy to document/grasp for users.

temporaer · 2013-10-22T08:01:49Z

When I first implemented this, I had something quite like Andy's solution. With one addition, however: I had a parameter assume_data_contains_zeros, which was True by default for sparse X. If the empirical data range did not include zeros, it would be enlarged to contain it. I thought it was intuitive since sparse data should be allowed to contain zeros anyway.

Pro:

more stable than Andys solution (less likely to fail during grid search)

Con:

behavior on dense and sparse, but otherwise identical X may differ if assume_data_contains_zeros is left to default (False for dense, True for sparse)

larsmans · 2013-10-22T08:50:24Z

Different behavior on sparse and dense has turned out to be very confusing for users. I've been trying to get rid of that everywhere I can.

ogrisel · 2013-10-22T08:53:45Z

The MaxAbsScaler idea is interesting. We can keep the MinMaxScaler as it is to not break the backward compat but recommend to use MaxAbsScaler in the doc as it can support dense and sparse data in a consistent way by default.

amueller · 2013-10-22T15:42:29Z

Does anyone have a usecase that needs the current functionality of MinMaxScaler? I think I coded it this way, but I have no idea why.

amueller · 2015-06-11T19:47:52Z

Superseded by #4828.

mblondel reviewed Mar 22, 2013
View reviewed changes

sklearn/utils/sparsefuncs.pyx

cdef unsigned int ind

cdef double v

# means[j] contains the minimum of feature j

Copy link

Member

mblondel Mar 22, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong comment :)

larsmans reviewed Mar 27, 2013
View reviewed changes

larsmans reviewed Apr 16, 2013
View reviewed changes

temporaer added 2 commits July 10, 2013 11:43

FIX missing type of cython variable in sparse matrix tools

abb9a68

ENH add sparse matrix support to MinMaxScaler

af39e51

ogrisel reviewed Jul 10, 2013
View reviewed changes

amueller mentioned this pull request Oct 14, 2013

[MRG+1-1] Refactoring and expanding sklearn.preprocessing scaling #2514

Closed

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

untom mentioned this pull request Jun 7, 2015

[MRG+2] add MaxAbsScaler #4828

Merged

amueller closed this Jun 11, 2015

Uh oh!

Global normalization and sparse matrix support for MinMaxScaler #1799

Global normalization and sparse matrix support for MinMaxScaler #1799

Uh oh!

Conversation

temporaer commented Mar 21, 2013

Uh oh!

amueller commented Mar 21, 2013

Uh oh!

amueller commented Mar 21, 2013

Uh oh!

temporaer commented Mar 21, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Mar 22, 2013

Uh oh!

ogrisel commented Mar 22, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Apr 2, 2013

Uh oh!

temporaer commented Apr 2, 2013

Uh oh!

larsmans commented Apr 2, 2013

Uh oh!

amueller commented Apr 2, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented May 5, 2013

Uh oh!

larsmans commented May 5, 2013

Uh oh!

amueller commented May 5, 2013

Uh oh!

temporaer commented May 5, 2013

Uh oh!

amueller commented May 6, 2013

Uh oh!

larsmans commented May 6, 2013

Uh oh!

amueller commented May 6, 2013

Uh oh!

larsmans commented May 20, 2013

Uh oh!

amueller commented May 21, 2013

Uh oh!

temporaer commented Jun 30, 2013

Uh oh!

ogrisel commented Jul 9, 2013

Uh oh!

temporaer commented Jul 10, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jul 10, 2013

Uh oh!

temporaer commented Jul 10, 2013

Uh oh!

ogrisel commented Jul 10, 2013

Uh oh!

ogrisel commented Oct 21, 2013

Uh oh!

temporaer commented Oct 21, 2013

Uh oh!

ogrisel commented Oct 21, 2013

Uh oh!

untom commented Oct 21, 2013

Uh oh!

amueller commented Oct 22, 2013