-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
imputation by knn #2989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
How would that work? |
I assume it's seeking something like http://www.mathworks.com.au/help/bioinfo/ref/knnimpute.html:
Only I think we're talking about rows, not columns, in our data. |
When will the 'knn' imputation be incorporated? |
This is extract from the book "Applied Predictive Modeling" by Max Kuhn and Kjell Johnson One popular technique for imputation is a K-nearest neighbor model. |
Hello Chitcode, Thanks for your reply. I'm aware of the Max Kuhn book and the reference...I'm not sure how it is relevant to the implementation of 'knn' strategy in sklearn.preprocessing.Imputer. I just wanted to know if and when this will be incorporated...I'm a newbie to python and sklearn and been bit frustrated with not finding alternatives beyond mean/median imputing.. I found this link for imputing in Python via MICE but it doesnt seem to be complete yet... |
Well it sounds like a good reference for someone interested in implementing this.
As far as I know nobody is working an implementation at the moment.
You could train alternative models such as Random Forests instead of to predict the missing values instead of KNN. |
You could train alternative models such as Random Forests instead of to
predict the missing values instead of KNN.
Which wouldn't work well in high-dimensional settings... I think that the
right API would be to be able to pass in an estimator for the prediction,
and have a set of predefined choices that would be specified by a string,
such as 'knn', 'random_forest', 'ridge'...
|
thanks for the replies. I'm not yet good in python programming to write the 'knn/ RF' method myself.. OGrisel : I have seen your youtube tutorial on scikit learn...do you always use or recommend mean / median approach to implement or write custom functions yourself each time? |
@skdatascientist please do not hijack this github issue to ask for assistance or recommendations. This is not a discussion group but an issue tracker for a software project. Now to end this side-discussion: I am no datascientist myself so I cannot recommend anything. If I had to do deal with missing data in my own projects I would probably try to implement both KNN and RF based imputation and compare their impact by measuring the cross-validation score of my final model with both strategies and compare to a median imputation baseline. |
My apologies, my intention wasn't to 'hijack'. Thanks for your replies. |
I think knn is a common method, and I think it is worth implementing it not using a general estimator. If we would use a general estimator, you have to train a separate model for each missing value, right? That seems very expensive with a RandomForestRegressor. Using KNN I could use a once-built data structure over and over again. |
@tw991 was interested in working on this, and I didn't want to set him up for something that we won't include. |
+1000. Quite clearly very useful and standard. Sent from my phone. Please forgive brevity and mis spelling On Jun 8, 2015, 14:39, at 14:39, Andreas Mueller [email protected] wrote:
|
I just see that the current imputer is ignorant of classes. Should it be? |
It does seem to me that it should not be. That said, I can see situations where one might want that (rare classes, regression settings). How about an option to control that, and maybe a deprecation cycle for the default value? Sent from my phone. Please forgive brevity and mis spelling On Jun 8, 2015, 16:17, at 16:17, Andreas Mueller [email protected] wrote:
|
I don't understand how one uses classes in e.g. mean imputation. Sure one On 9 June 2015 at 09:54, Gael Varoquaux [email protected] wrote:
|
Good point! That's another situation when it might be useful to break the |
Is it appropriate to use a different imputation strategy at train and test On 9 June 2015 at 13:13, Gael Varoquaux [email protected] wrote:
|
I don't see why not. It does not break independence between train and test. If I were to try to solve a real world problem, I don't think that I would mind. Sent from my phone. Please forgive brevity and mis spelling On Jun 8, 2015, 20:22, at 20:22, jnothman [email protected] wrote:
|
You must read the article about kNN impuation "Missing value estimation methods for DNA microarrays", Olga Troyanskaya et. al., Bioinformatics, Vol. 17, No. 6, 2001, pp. 520-525. |
share some lib from http://scikit-learn.org/stable/modules/neighbors.html? |
+1 |
1 similar comment
+1 |
+1, also random forest imputation would be great. https://cran.r-project.org/web/packages/missForest/missForest.pdf |
+1, I would be happy to submit a patch, If someone can point me in direction on how tos and coding guidelines for sklearn. |
See http://scikit-learn.org/stable/developers/contributing.html but also note that there's something somewhat in the works at #4844. Maybe you could volunteer to help or take that over. |
[UPDATED after comments 1, 2] How about something like this?
|
I think the point of KNN imputation is that it can be done relatively efficiently, even incorporating samples that have NaNs in them. Your solution is more generic, though one could also consider strategies that incorporate all training data (not just those missing no values) by using a default imputation strategy over it, or indeed iteratively updating the estimate as in the strategies implemented at https://github.com/hammerlab/fancyimpute/. |
@essicolo Hello, I think there may be a tiny issue in your code, for instance, for the array below: x11, x12, na, na When you consider filling the na at first row third column, the second row should not be removed, since we will only use the first two columns as feature, and third column as target, and the na at second row, fourth column will not matter. |
Good comments. I added an iteration loop (updated in the previous comment). In the fist loop, the imputation is done with the initial information. In the second However, in the example I wrote that the number of iterations doesn't affect the outcome. I can't figure out why. |
Hi all -- as you might have noticed, I recently added a pull request that implements a kNN imputation algorithm. I would very much welcome and appreciate any feedback/suggestions/criticism you might have regarding the code. Also, it would be super if anybody wanted to join in, as the development is not yet complete. |
You might also want to consider how the R package Caret does things: https://github.com/topepo/caret/blob/master/pkg/caret/R/preProcess.R#L738-L754 Caret internally use the function |
I'll definitely take a look. Thanks! |
Adding a new strategy='knn' in sklearn.preprocessing.Imputer class for imputing the missing values usign knn method.
The text was updated successfully, but these errors were encountered: