Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Rank normalization of features #1062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
turian opened this issue Aug 24, 2012 · 11 comments
Closed

Rank normalization of features #1062

turian opened this issue Aug 24, 2012 · 11 comments

Comments

@turian
Copy link

turian commented Aug 24, 2012

I was talking about this feature with @ogrisel and he asked me to place an issue for it.

This is a technique suggested by Yoshua Bengio to handle features with unknown scale:

Convert the features to rank scale, so the lowest rank is 0 and the highest rank is 1. This is superior to z-transform (zero mean, unit variance) because if you have one huge outlier feature, it can mess things up. But a rank-transform is robust.

You can see a description here of Python code to do this:
http://stackoverflow.com/questions/3071415/efficient-method-to-calculate-the-rank-vector-of-a-list-in-python
scipy.stats.rankdata does it.
They convert to ranks, but don't normalize by the number of features.

One thing to be careful of:
If you have a lot of zeros and do a rank transform using scipy.stats.rankdata and then normalize by the number of features, they will end up have rank > 0 so you lose sparsity. I would recommend, to preserve sparsity, that you scale the range of the ranks to [0, 1] and clip any feature from the test set that exceeds the range.

@agramfort
Copy link
Member

this could be a Ranker object next to the Scaler we have in preprocessing module.
PR welcome :)

@mblondel
Copy link
Member

Ranker sounds a bit too generic (could be confused with learning to rank). How about RankScaler?

BTW, how does it work for unseen data? (transform method of the transformer)

@turian
Copy link
Author

turian commented Aug 27, 2012

The difficulty is that you have to store all the rank information in the transform.

So if you have roughly 50K different ranks in each feature and 100 features, that's a rank matrix of 50K x 100.

I think there should be a parameter that controls the # of ranks per feature. A default of 1000 sounds reasonable.

@turian
Copy link
Author

turian commented Aug 27, 2012

RankScaler is a better name IMO

@agramfort
Copy link
Member

RankScaler is a better name IMO

sounds good. PR welcome :)

@GaelVaroquaux
Copy link
Member

Ranker sounds a bit too generic (could be confused with learning to rank). How about RankScaler?

+1.

@amueller
Copy link
Member

+1 for PR ;)

@turian
Copy link
Author

turian commented Jul 21, 2013

PR #2176 : #2176

@mblondel mblondel mentioned this issue Nov 10, 2015
@jnothman
Copy link
Member

I think this is more-or-less fixed by QuantileTransformer...

@ajay9022
Copy link

Is there any resource from where I can learn Rank Transformation. I can't find it given anywhere in detail !!

@jnothman
Copy link
Member

I think this is now QuantileTransformer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants