Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Global normalization and sparse matrix support for MinMaxScaler #1799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

temporaer
Copy link
Contributor

tests working & doc in place, needs feedback and merge if OK

@amueller
Copy link
Member

awesome, thanks :)
Could it be that you forgot to add some files / modify the setup.py? Travis is complaining.

@amueller
Copy link
Member

You didn't add the updates c file ;)

@temporaer
Copy link
Contributor Author

yep, I didn't want to clutter the commit log at this stage. I'll add some tests first and then care about this whiny Travis guy.

cdef unsigned int ind
cdef double v

# means[j] contains the minimum of feature j
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong comment :)

@ogrisel
Copy link
Member

ogrisel commented Mar 22, 2013

yep, I didn't want to clutter the commit log at this stage.

Please feel free to checkin the .c file as well. The diff clutter can be controlled by:

https://github.com/scikit-learn/scikit-learn/blob/master/.gitattributes

@ogrisel
Copy link
Member

ogrisel commented Mar 22, 2013

This looks like a great contrib @temporaer . Looking forward to the tests and some documentation updates in the preprocessing section.

data_range = data_max - data_min
else:
# TODO why would anyone want to force a copy here? (this was here
# before, I believe)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the copy is necessary. It's only relevant in the transform method.

@amueller
Copy link
Member

amueller commented Apr 2, 2013

@temporaer could you please squash the commits? (because that's something we do now ^^)

@temporaer
Copy link
Contributor Author

uh... squash /all/ of them into one? Or just clean up a little?

@larsmans
Copy link
Member

larsmans commented Apr 2, 2013

Preferably everything. That keeps the history slim, makes reverting easy, prevents broken states in the history, and hides our mistakes from the users ;)

@amueller
Copy link
Member

amueller commented Apr 2, 2013

Thanks. +1 for merge :)

Set to False to normalize using the minimum/maximum of all
features, not per column

assume_contains_zeros : boolean, optional, default is False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I really don't understand this explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about: If True, normalize all columns separately. Set to False to normalize using minimum and maximum of X. Maybe also rename to per_column?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't referring to per_feature, that's fine as it is. I don't understand the behavior of assume_contains_zeros.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm OK, there was only one "explanation" in the excerpt that github selected :-)

How about:
MinMaxScaler does not work for sparse matrices when the additive component of the normalization is non-zero. When normalizing to the range of [0,1], this implies that the minimum in every column of X must be 0. Setting assume_contains_zeros=True enforces this constraint even if some columns of X are strictly positive.

@amueller
Copy link
Member

amueller commented May 5, 2013

@larsmans ok for merge?

@larsmans
Copy link
Member

larsmans commented May 5, 2013

The code looks alright, but I still don't understand the assume_contains_zeros option, and I want the API to be clear before merging this.

@amueller
Copy link
Member

amueller commented May 5, 2013

I agree. I'm also not sure what it is we actually want in the sparse case.
I think the default option should keep sparsity structure, as this is how we did it in other cases afaik (RandomizedPCA for example), and the user needs to be explicit to destroy sparsity structure.

So maybe we want an with_offset option which is True for dense and False for sparse input.

If the input if strictly positive, between 0 and x and the given range is [0, a] then the right thing will happen.
If the input is between -x and x and the range is [-a, a] also the right thing will happen.

In all other cases, something a bit weird would happen, though
If the input is between -x and x and you ask for [0, a] you would get [0, a/2] I think... well...

@temporaer
Copy link
Contributor Author

Making centering optional sounds good to me, it is somewhat more intuitive than "injecting" artificial zeros. We'd still need to explain the reasoning why the user would want to change default behavior, and there the complexity would be unchanged. I have no preference here and I'm happy to implement what you API guys decide :-)

Andreas Mueller [email protected] schrieb:

I agree. I'm also not sure what it is we actually want in the sparse case.
I think the default option should keep sparsity structure, as this is how we did it in other cases afaik (RandomizedPCA for example), and the user needs to be explicit to destroy sparsity structure.

So maybe we want an with_offset option which is True for dense and False for sparse input.

If the input if strictly positive, between 0 and x and the given range is [0, a] then the right thing will happen.
If the input is between -x and x and the range is [-a, a] also the right thing will happen.

In all other cases, something a bit weird would happen, though
If the input is between -x and x and you ask for [0, a] you would get [0, a/2] I think... well...


Reply to this email directly or view it on GitHub.

@amueller
Copy link
Member

amueller commented May 6, 2013

Any opinion @larsmans ?

@larsmans
Copy link
Member

larsmans commented May 6, 2013

I think center="auto" with a default value of issparse(X) and an exception for the combination of True and dense arrays would be fine.

@amueller
Copy link
Member

amueller commented May 6, 2013

I wouldn't call it 'center'. And you probably mean "sparse" + True is an exception, right?

@larsmans
Copy link
Member

It's time to get this merged. @amueller, what name do you suggest?

@amueller
Copy link
Member

I would have called it with_offset (similar to with_mean).

@temporaer
Copy link
Contributor Author

Hope it's all right now...

@ogrisel
Copy link
Member

ogrisel commented Jul 9, 2013

Could you please rebase on top of master (or merge master to your branch) and fix any conflict so as to make github happy?

@temporaer
Copy link
Contributor Author

@ogrisel rebase done, travis happy ;-)

where differences between features with a small range will influence
distances, and thus classification, less). You may want to keep this
relative importance information and still rescale your features to a
sensible range. This can be done by setting ``per_feature``::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> "This can be done by setting per_feature to False:"


with_offset : boolean, optional, default is True for dense,
and False for sparse X.
If False, no additive normalization is used, only scaling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could rephrase this to make it more explicit? What is "additive transformation"?

@ogrisel
Copy link
Member

ogrisel commented Jul 10, 2013

Apart from the lack of documentation for the with_offset option, it looks good. I am not sure about the fact that the behavior is not the same for sparse matrix and dense arrays is different. A naive user would likely expect the default behavior to be the same (as long as min_ == 0).

@temporaer
Copy link
Contributor Author

@ogrisel we discussed the with_offset behavior above. Summary: My first implementation was to keep the behavior the same for dense/sparse. The complicating issue here is that sparse matrices include implicit zeros, which needs to be taken into account. You could then argue that sparse matrices are /meant/ to have zeros, so zero should always be part of the input range. A sparse matrix which does not have zeros would then behave differently than its dense counterpart. The original implementation had a parameter assume_contains_zeros, with default True for sparse matrices. That was too complicated to document and non-intuitive for naive users. I think the solution as it is now (by default, only use scaling for sparse) is the better of the two evils.

@ogrisel
Copy link
Member

ogrisel commented Jul 10, 2013

Still better documentation (with example) is required because right now I am not sure what it actually does.

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2013

@temporaer do you plan to finish the doc work on this PR or would you like someone else to take over the PR from there?

@temporaer
Copy link
Contributor Author

@ogrisel when I tried last, it turned out to be more than just doc work. I'll try to to produce something discussable in the next few days.

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2013

Thanks!

@untom
Copy link
Contributor

untom commented Oct 21, 2013

Hi there!

I'm working on something similar over in #2514, and I think if we might end up replicating effort (I was in fact already looking into merging your PR into mine) :) Maybe we should hash something out together before we both waste time?

Where are you currently stuck/having problems?

@amueller
Copy link
Member

Tonights favoured (by me) solution to the semantic problems: if data is sparse and the offset is non-zero, raise a value error. That means that on the off-chance that a sparse array doesn't contain a zero (and it is scaled between, say 0 and 1), it will bail.

Pro:

  • when it computes something, it is consistent with the dense case
  • It never makes a matrix dense
  • If it computes something, the result is very natural

Con:

  • it might fail during a GridSearch.

@larsmans @ogrisel wdyt?

The second best option: make MinMaxScaler densify if it has to and make an additional estimator that only scales data.

@amueller
Copy link
Member

Ok even better (though deprecation cycle inducing) idea: remove MinMaxScaler as no-one needs that functionality anyway. Instead introduce MaxAbsScaler, that scales the data to have a given maximum absolute value, and never adds anything to the data.

@untom
Copy link
Contributor

untom commented Oct 22, 2013

I personally was always wondering why the option to add something to the data existed, I figured someone must've had a use case for it. What I came up with last night was making the behavior dependent on the feature_range argument. On dense input, keep the current behavior, while on sparse input:

  • throw execption if range is [+a, +b] or [-a, -b] (i.e., does not cross zero, hence would destroy sparsity)
  • throw exception if range is [0, +a], but the data contains values < 0: (i.e., it would destroy zeros in at least one feature).... same if range is [-a, 0] and data contains values > 0.
  • if range is [-a, +b], scale all nonzero values (i.e., zeros will be preserved).... this behavior is different from what the dense part does, but it should be obvious why and it is easy to document/grasp for users.

@temporaer
Copy link
Contributor Author

When I first implemented this, I had something quite like Andy's solution. With one addition, however: I had a parameter assume_data_contains_zeros, which was True by default for sparse X. If the empirical data range did not include zeros, it would be enlarged to contain it. I thought it was intuitive since sparse data should be allowed to contain zeros anyway.

Pro:

  • more stable than Andys solution (less likely to fail during grid search)

Con:

  • behavior on dense and sparse, but otherwise identical X may differ if assume_data_contains_zeros is left to default (False for dense, True for sparse)

@larsmans
Copy link
Member

Different behavior on sparse and dense has turned out to be very confusing for users. I've been trying to get rid of that everywhere I can.

@ogrisel
Copy link
Member

ogrisel commented Oct 22, 2013

The MaxAbsScaler idea is interesting. We can keep the MinMaxScaler as it is to not break the backward compat but recommend to use MaxAbsScaler in the doc as it can support dense and sparse data in a consistent way by default.

@amueller
Copy link
Member

Does anyone have a usecase that needs the current functionality of MinMaxScaler? I think I coded it this way, but I have no idea why.

@amueller
Copy link
Member

Superseded by #4828.

@amueller amueller closed this Jun 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants