-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Rank normalization of features #1062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
this could be a Ranker object next to the Scaler we have in preprocessing module. |
BTW, how does it work for unseen data? ( |
The difficulty is that you have to store all the rank information in the transform. So if you have roughly 50K different ranks in each feature and 100 features, that's a rank matrix of 50K x 100. I think there should be a parameter that controls the # of ranks per feature. A default of 1000 sounds reasonable. |
RankScaler is a better name IMO |
sounds good. PR welcome :) |
+1. |
+1 for PR ;) |
I think this is more-or-less fixed by QuantileTransformer... |
Is there any resource from where I can learn Rank Transformation. I can't find it given anywhere in detail !! |
I think this is now QuantileTransformer |
I was talking about this feature with @ogrisel and he asked me to place an issue for it.
This is a technique suggested by Yoshua Bengio to handle features with unknown scale:
Convert the features to rank scale, so the lowest rank is 0 and the highest rank is 1. This is superior to z-transform (zero mean, unit variance) because if you have one huge outlier feature, it can mess things up. But a rank-transform is robust.
You can see a description here of Python code to do this:
http://stackoverflow.com/questions/3071415/efficient-method-to-calculate-the-rank-vector-of-a-list-in-python
scipy.stats.rankdata does it.
They convert to ranks, but don't normalize by the number of features.
One thing to be careful of:
If you have a lot of zeros and do a rank transform using scipy.stats.rankdata and then normalize by the number of features, they will end up have rank > 0 so you lose sparsity. I would recommend, to preserve sparsity, that you scale the range of the ranks to [0, 1] and clip any feature from the test set that exceeds the range.
The text was updated successfully, but these errors were encountered: