Thanks to visit codestin.com
Credit goes to github.com

Skip to content

imputation by knn #2989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #36
chitcode opened this issue Mar 22, 2014 · 33 comments · Fixed by #12852
Closed
Tracked by #36

imputation by knn #2989

chitcode opened this issue Mar 22, 2014 · 33 comments · Fixed by #12852

Comments

@chitcode
Copy link

Adding a new strategy='knn' in sklearn.preprocessing.Imputer class for imputing the missing values usign knn method.

@larsmans
Copy link
Member

How would that work?

@jnothman
Copy link
Member

I assume it's seeking something like http://www.mathworks.com.au/help/bioinfo/ref/knnimpute.html:

knnimpute(Data, k) replaces NaNs in Data with a weighted mean of the k nearest-neighbor columns. The weights are inversely proportional to the distances from the neighboring columns.

Only I think we're talking about rows, not columns, in our data.

@sskarkhanis
Copy link

When will the 'knn' imputation be incorporated?

@chitcode
Copy link
Author

This is extract from the book "Applied Predictive Modeling" by Max Kuhn and Kjell Johnson

One popular technique for imputation is a K-nearest neighbor model.
A new sample is imputed by finding the samples in the training set “closest”
to it and averages these nearby points to fill in the value. Troyanskaya et al.
(2001) examine this approach for high-dimensional data with small sample
sizes. One advantage of this approach is that the imputed data are confined
to be within the range of the training set values. One disadvantage is that the
entire training set is required every time a missing value needs to be imputed.
Also, the number of neighbors is a tuning parameter, as is the method for determining “closeness” of two points. However, Troyanskaya et al. (2001) found
the nearest neighbor approach to be fairly robust to the tuning parameters,
as well as the amount of missing data.

@sskarkhanis
Copy link

Hello Chitcode,

Thanks for your reply. I'm aware of the Max Kuhn book and the reference...I'm not sure how it is relevant to the implementation of 'knn' strategy in sklearn.preprocessing.Imputer.

I just wanted to know if and when this will be incorporated...I'm a newbie to python and sklearn and been bit frustrated with not finding alternatives beyond mean/median imputing..

I found this link for imputing in Python via MICE but it doesnt seem to be complete yet...
http://gsocfrankcheng.blogspot.ca/
What alternatives would you suggest to impute beyond mean/median? I'm new to Python, so I'm not sure if I could / would want to write a full-fledged implementation of knn/random forest to impute myself..
cheers

@ogrisel
Copy link
Member

ogrisel commented Feb 27, 2015

Thanks for your reply. I'm aware of the Max Kuhn book and the reference...I'm not sure how it is relevant to the implementation of 'knn' strategy in sklearn.preprocessing.Imputer.

Well it sounds like a good reference for someone interested in implementing this.

I just wanted to know if and when this will be incorporated...

As far as I know nobody is working an implementation at the moment.

What alternatives would you suggest to impute beyond mean/median?

You could train alternative models such as Random Forests instead of to predict the missing values instead of KNN.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 27, 2015 via email

@sskarkhanis
Copy link

thanks for the replies. I'm not yet good in python programming to write the 'knn/ RF' method myself..
I tried to call R packages from python with rpy2 to get imputations done with R packages but couldnt manage to get it work yet..
I use many of the packages listed here when I do ML in R..
Link : http://www.stefvanbuuren.nl/mi/Software.html
I hope there will be similar techniques for imputing in python in the future..

OGrisel : I have seen your youtube tutorial on scikit learn...do you always use or recommend mean / median approach to implement or write custom functions yourself each time?

@ogrisel
Copy link
Member

ogrisel commented Feb 27, 2015

@skdatascientist please do not hijack this github issue to ask for assistance or recommendations. This is not a discussion group but an issue tracker for a software project. Now to end this side-discussion: I am no datascientist myself so I cannot recommend anything. If I had to do deal with missing data in my own projects I would probably try to implement both KNN and RF based imputation and compare their impact by measuring the cross-validation score of my final model with both strategies and compare to a median imputation baseline.

@sskarkhanis
Copy link

My apologies, my intention wasn't to 'hijack'. Thanks for your replies.

@amueller
Copy link
Member

amueller commented Jun 8, 2015

I think knn is a common method, and I think it is worth implementing it not using a general estimator. If we would use a general estimator, you have to train a separate model for each missing value, right? That seems very expensive with a RandomForestRegressor. Using KNN I could use a once-built data structure over and over again.

@amueller
Copy link
Member

amueller commented Jun 8, 2015

@tw991 was interested in working on this, and I didn't want to set him up for something that we won't include.

@GaelVaroquaux
Copy link
Member

+1000. Quite clearly very useful and standard.

Sent from my phone. Please forgive brevity and mis spelling

On Jun 8, 2015, 14:39, at 14:39, Andreas Mueller [email protected] wrote:

I think knn is a common method, and I think it is worth implementing it
not using a general estimator. If we would use a general estimator, you
have to train a separate model for each missing value, right? That
seems very expensive with a RandomForestRegressor. Using KNN I could
use a once-built data structure over and over again.


Reply to this email directly or view it on GitHub:
#2989 (comment)

@amueller
Copy link
Member

amueller commented Jun 8, 2015

I just see that the current imputer is ignorant of classes. Should it be? y should be optional and used if given, right? Not sure how important that is for knn, but for mean/median strategies that seems kinda important.

@GaelVaroquaux
Copy link
Member

It does seem to me that it should not be. That said,  I can see situations where one might want that (rare classes,  regression settings).

How about an option to control that,  and maybe a deprecation cycle for the default value?

Sent from my phone. Please forgive brevity and mis spelling

On Jun 8, 2015, 16:17, at 16:17, Andreas Mueller [email protected] wrote:

I just see that the current imputer is ignorant of classes. Should it
be? y should be optional and used if given, right? Not sure how
important that is for knn, but for mean/median strategies that seems
kinda important.


Reply to this email directly or view it on GitHub:
#2989 (comment)

@jnothman
Copy link
Member

jnothman commented Jun 9, 2015

I don't understand how one uses classes in e.g. mean imputation. Sure one
can model a different mean for each class, but how does that apply at test
time?

On 9 June 2015 at 09:54, Gael Varoquaux [email protected] wrote:

It does seem to me that it should not be. That said, I can see situations
where one might want that (rare classes, regression settings).

How about an option to control that, and maybe a deprecation cycle for
the default value?

Sent from my phone. Please forgive brevity and mis spelling

On Jun 8, 2015, 16:17, at 16:17, Andreas Mueller [email protected]
wrote:

I just see that the current imputer is ignorant of classes. Should it
be? y should be optional and used if given, right? Not sure how
important that is for knn, but for mean/median strategies that seems
kinda important.


Reply to this email directly or view it on GitHub:

#2989 (comment)


Reply to this email directly or view it on GitHub
#2989 (comment)
.

@GaelVaroquaux
Copy link
Member

I don't understand how one uses classes in e.g. mean imputation. Sure one
can model a different mean for each class, but how does that apply at test
time?

Good point! That's another situation when it might be useful to break the
equivalence between fit_transform and fit + transform (we were discussing
with @amueller that for data resampling / downsampling / upsampling such
a choice might be useful).

@jnothman
Copy link
Member

jnothman commented Jun 9, 2015

Is it appropriate to use a different imputation strategy at train and test
time?

On 9 June 2015 at 13:13, Gael Varoquaux [email protected] wrote:

I don't understand how one uses classes in e.g. mean imputation. Sure one
can model a different mean for each class, but how does that apply at
test
time?

Good point! That's another situation when it might be useful to break the
equivalence between fit_transform and fit + transform (we were discussing
with @amueller that for data resampling / downsampling / upsampling such
a choice might be useful).


Reply to this email directly or view it on GitHub
#2989 (comment)
.

@GaelVaroquaux
Copy link
Member

I don't see why not. It does not break independence between train and test. If I were to try to solve a real world problem, I don't think that I would mind.

Sent from my phone. Please forgive brevity and mis spelling

On Jun 8, 2015, 20:22, at 20:22, jnothman [email protected] wrote:

Is it appropriate to use a different imputation strategy at train and
test
time?

On 9 June 2015 at 13:13, Gael Varoquaux [email protected]
wrote:

I don't understand how one uses classes in e.g. mean imputation.
Sure one
can model a different mean for each class, but how does that apply
at
test
time?

Good point! That's another situation when it might be useful to break
the
equivalence between fit_transform and fit + transform (we were
discussing
with @amueller that for data resampling / downsampling / upsampling
such
a choice might be useful).


Reply to this email directly or view it on GitHub

#2989 (comment)
.


Reply to this email directly or view it on GitHub:
#2989 (comment)

@imanojkumar
Copy link

You must read the article about kNN impuation "Missing value estimation methods for DNA microarrays", Olga Troyanskaya et. al., Bioinformatics, Vol. 17, No. 6, 2001, pp. 520-525.

@movelikeriver
Copy link

share some lib from http://scikit-learn.org/stable/modules/neighbors.html?
also need to support discrete data for distance calculation.

@jachymb
Copy link

jachymb commented Feb 29, 2016

+1

1 similar comment
@brucechou1983
Copy link

+1

@bobcolner
Copy link

+1, also random forest imputation would be great. https://cran.r-project.org/web/packages/missForest/missForest.pdf

@ankitagarwal
Copy link

ankitagarwal commented Nov 20, 2016

+1, I would be happy to submit a patch, If someone can point me in direction on how tos and coding guidelines for sklearn.

@jnothman
Copy link
Member

See http://scikit-learn.org/stable/developers/contributing.html but also note that there's something somewhat in the works at #4844. Maybe you could volunteer to help or take that over.

@essicolo
Copy link

essicolo commented Jan 24, 2017

[UPDATED after comments 1, 2] How about something like this?

import numpy as np
import random
from sklearn import datasets
from sklearn import neighbors

def impute(mat, learner, n_iter=3):
    mat = np.array(mat)
    mat_isnan = np.isnan(mat)        
    w = np.where(np.isnan(mat))
    ximp = mat.copy()
    for i in range(0, len(w[0])):
        n = w[0][i] # row where the nan is
        p = w[1][i] # column where the nan is
        col_isnan = mat_isnan[n, :] # empty columns in row n
        train = np.delete(mat, n, axis = 0) # remove row n to obtain a training set
        train_nonan = train[~np.apply_along_axis(np.any, 1, np.isnan(train)), :] # remove rows where there is a nan in the training set
        target = train_nonan[:, p] # vector to be predicted
        feature = train_nonan[:, ~col_isnan] # matrix of predictors
        learn = learner.fit(feature, target) # learner
        ximp[n, p] = learn.predict(mat[n, ~col_isnan].reshape(1, -1)) # predict and replace
    for iter in range(0, n_iter):
        for i in random.sample(range(0, len(w[0])), len(w[0])):
            n = w[0][i] # row where the nan is
            p = w[1][i] # column where the nan is
            train = np.delete(ximp, n, axis = 0) # remove row n to obtain a training set
            target = train[:, p] # vector to be predicted
            feature = np.delete(train, p, axis=1) # matrix of predictors
            learn = learner.fit(feature, target) # learner
            ximp[n, p] = learn.predict(np.delete(ximp[n,:], p).reshape(1, -1)) # predict and replace
    
    return ximp

# Impute with learner in the iris data set
iris = datasets.load_iris()
mat = iris.data.copy()

# throw some nans
mat[0,2] = np.NaN
mat[0,3] = np.NaN
mat[1,3] = np.NaN
mat[11,1] = np.NaN
mat = mat[range(30), :]

# impute
impute(mat=mat, learner=neighbors.KNeighborsRegressor(n_neighbors=3), n_iter=10)

@jnothman
Copy link
Member

I think the point of KNN imputation is that it can be done relatively efficiently, even incorporating samples that have NaNs in them. Your solution is more generic, though one could also consider strategies that incorporate all training data (not just those missing no values) by using a default imputation strategy over it, or indeed iteratively updating the estimate as in the strategies implemented at https://github.com/hammerlab/fancyimpute/.

@wqp89324
Copy link

@essicolo Hello, I think there may be a tiny issue in your code, for instance, for the array below:

x11, x12, na, na
x21, x22, x23, na
x31, x32, x33, x34

When you consider filling the na at first row third column, the second row should not be removed, since we will only use the first two columns as feature, and third column as target, and the na at second row, fourth column will not matter.

@essicolo
Copy link

essicolo commented Mar 10, 2017

@jnothman , @wqp89324

Good comments. I added an iteration loop (updated in the previous comment). In the fist loop, the imputation is done with the initial information. In the second iter loop, only the row and column of the NaN is discarded: this step is performed in a random order.

However, in the example I wrote that the number of iterations doesn't affect the outcome. I can't figure out why.

@ashimb9
Copy link
Contributor

ashimb9 commented Jun 24, 2017

Hi all -- as you might have noticed, I recently added a pull request that implements a kNN imputation algorithm. I would very much welcome and appreciate any feedback/suggestions/criticism you might have regarding the code. Also, it would be super if anybody wanted to join in, as the development is not yet complete.
With regards to the progress, I am happy to report that the implementation works fine in cases where the same data matrix is passed in fit() and transform(). And by "works fine", I mean in comparison to the R-package Impute that this implementation is inspired from. Some other things remain, and I hope to complete them in the coming days. Anyway, I look forward to hearing back from the community. Thanks! :)

@gwerbin
Copy link

gwerbin commented Jul 6, 2017

You might also want to consider how the R package Caret does things: https://github.com/topepo/caret/blob/master/pkg/caret/R/preProcess.R#L738-L754

Caret internally use the function RANN::nn2. Not sure if there's a comparable kNN implementation in Python.

@ashimb9
Copy link
Contributor

ashimb9 commented Jul 7, 2017

I'll definitely take a look. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet