imputation by knn #2989

chitcode · 2014-03-22T10:18:52Z

Adding a new strategy='knn' in sklearn.preprocessing.Imputer class for imputing the missing values usign knn method.

larsmans · 2014-05-10T15:07:12Z

How would that work?

jnothman · 2014-05-10T20:57:41Z

I assume it's seeking something like http://www.mathworks.com.au/help/bioinfo/ref/knnimpute.html:

knnimpute(Data, k) replaces NaNs in Data with a weighted mean of the k nearest-neighbor columns. The weights are inversely proportional to the distances from the neighboring columns.

Only I think we're talking about rows, not columns, in our data.

sskarkhanis · 2015-02-26T17:26:43Z

When will the 'knn' imputation be incorporated?

chitcode · 2015-02-27T11:38:38Z

This is extract from the book "Applied Predictive Modeling" by Max Kuhn and Kjell Johnson

One popular technique for imputation is a K-nearest neighbor model.
A new sample is imputed by finding the samples in the training set “closest”
to it and averages these nearby points to fill in the value. Troyanskaya et al.
(2001) examine this approach for high-dimensional data with small sample
sizes. One advantage of this approach is that the imputed data are confined
to be within the range of the training set values. One disadvantage is that the
entire training set is required every time a missing value needs to be imputed.
Also, the number of neighbors is a tuning parameter, as is the method for determining “closeness” of two points. However, Troyanskaya et al. (2001) found
the nearest neighbor approach to be fairly robust to the tuning parameters,
as well as the amount of missing data.

sskarkhanis · 2015-02-27T13:02:22Z

Hello Chitcode,

Thanks for your reply. I'm aware of the Max Kuhn book and the reference...I'm not sure how it is relevant to the implementation of 'knn' strategy in sklearn.preprocessing.Imputer.

I just wanted to know if and when this will be incorporated...I'm a newbie to python and sklearn and been bit frustrated with not finding alternatives beyond mean/median imputing..

I found this link for imputing in Python via MICE but it doesnt seem to be complete yet...
http://gsocfrankcheng.blogspot.ca/
What alternatives would you suggest to impute beyond mean/median? I'm new to Python, so I'm not sure if I could / would want to write a full-fledged implementation of knn/random forest to impute myself..
cheers

ogrisel · 2015-02-27T13:15:50Z

Thanks for your reply. I'm aware of the Max Kuhn book and the reference...I'm not sure how it is relevant to the implementation of 'knn' strategy in sklearn.preprocessing.Imputer.

Well it sounds like a good reference for someone interested in implementing this.

I just wanted to know if and when this will be incorporated...

As far as I know nobody is working an implementation at the moment.

What alternatives would you suggest to impute beyond mean/median?

You could train alternative models such as Random Forests instead of to predict the missing values instead of KNN.

GaelVaroquaux · 2015-02-27T13:18:31Z

You could train alternative models such as Random Forests instead of to predict the missing values instead of KNN.

Which wouldn't work well in high-dimensional settings... I think that the right API would be to be able to pass in an estimator for the prediction, and have a set of predefined choices that would be specified by a string, such as 'knn', 'random_forest', 'ridge'...

sskarkhanis · 2015-02-27T14:11:20Z

thanks for the replies. I'm not yet good in python programming to write the 'knn/ RF' method myself..
I tried to call R packages from python with rpy2 to get imputations done with R packages but couldnt manage to get it work yet..
I use many of the packages listed here when I do ML in R..
Link : http://www.stefvanbuuren.nl/mi/Software.html
I hope there will be similar techniques for imputing in python in the future..

OGrisel : I have seen your youtube tutorial on scikit learn...do you always use or recommend mean / median approach to implement or write custom functions yourself each time?

ogrisel · 2015-02-27T14:17:51Z

@skdatascientist please do not hijack this github issue to ask for assistance or recommendations. This is not a discussion group but an issue tracker for a software project. Now to end this side-discussion: I am no datascientist myself so I cannot recommend anything. If I had to do deal with missing data in my own projects I would probably try to implement both KNN and RF based imputation and compare their impact by measuring the cross-validation score of my final model with both strategies and compare to a median imputation baseline.

sskarkhanis · 2015-02-27T14:59:54Z

My apologies, my intention wasn't to 'hijack'. Thanks for your replies.

amueller · 2015-06-08T21:39:08Z

I think knn is a common method, and I think it is worth implementing it not using a general estimator. If we would use a general estimator, you have to train a separate model for each missing value, right? That seems very expensive with a RandomForestRegressor. Using KNN I could use a once-built data structure over and over again.

amueller · 2015-06-08T21:39:55Z

@tw991 was interested in working on this, and I didn't want to set him up for something that we won't include.

GaelVaroquaux · 2015-06-08T22:15:12Z

+1000. Quite clearly very useful and standard.

Sent from my phone. Please forgive brevity and mis spelling

On Jun 8, 2015, 14:39, at 14:39, Andreas Mueller [email protected] wrote:

I think knn is a common method, and I think it is worth implementing it
not using a general estimator. If we would use a general estimator, you
have to train a separate model for each missing value, right? That
seems very expensive with a RandomForestRegressor. Using KNN I could
use a once-built data structure over and over again.

Reply to this email directly or view it on GitHub:
#2989 (comment)

amueller · 2015-06-08T23:17:05Z

I just see that the current imputer is ignorant of classes. Should it be? y should be optional and used if given, right? Not sure how important that is for knn, but for mean/median strategies that seems kinda important.

GaelVaroquaux · 2015-06-08T23:54:12Z

It does seem to me that it should not be. That said, I can see situations where one might want that (rare classes, regression settings).

How about an option to control that, and maybe a deprecation cycle for the default value?

Sent from my phone. Please forgive brevity and mis spelling

On Jun 8, 2015, 16:17, at 16:17, Andreas Mueller [email protected] wrote:

I just see that the current imputer is ignorant of classes. Should it
be? y should be optional and used if given, right? Not sure how
important that is for knn, but for mean/median strategies that seems
kinda important.

Reply to this email directly or view it on GitHub:
#2989 (comment)

jnothman · 2015-06-09T03:08:21Z

I don't understand how one uses classes in e.g. mean imputation. Sure one
can model a different mean for each class, but how does that apply at test
time?

On 9 June 2015 at 09:54, Gael Varoquaux [email protected] wrote:

It does seem to me that it should not be. That said, I can see situations
where one might want that (rare classes, regression settings).

How about an option to control that, and maybe a deprecation cycle for
the default value?

Sent from my phone. Please forgive brevity and mis spelling

On Jun 8, 2015, 16:17, at 16:17, Andreas Mueller [email protected]
wrote:

I just see that the current imputer is ignorant of classes. Should it
be? y should be optional and used if given, right? Not sure how
important that is for knn, but for mean/median strategies that seems
kinda important.

Reply to this email directly or view it on GitHub:

#2989 (comment)

—
Reply to this email directly or view it on GitHub
#2989 (comment)
.

GaelVaroquaux · 2015-06-09T03:12:51Z

I don't understand how one uses classes in e.g. mean imputation. Sure one
can model a different mean for each class, but how does that apply at test
time?

Good point! That's another situation when it might be useful to break the
equivalence between fit_transform and fit + transform (we were discussing
with @amueller that for data resampling / downsampling / upsampling such
a choice might be useful).

jnothman · 2015-06-09T03:22:42Z

Is it appropriate to use a different imputation strategy at train and test
time?

On 9 June 2015 at 13:13, Gael Varoquaux [email protected] wrote:

I don't understand how one uses classes in e.g. mean imputation. Sure one
can model a different mean for each class, but how does that apply at
test
time?

Good point! That's another situation when it might be useful to break the
equivalence between fit_transform and fit + transform (we were discussing
with @amueller that for data resampling / downsampling / upsampling such
a choice might be useful).

—
Reply to this email directly or view it on GitHub
#2989 (comment)
.

GaelVaroquaux · 2015-06-09T04:35:21Z

I don't see why not. It does not break independence between train and test. If I were to try to solve a real world problem, I don't think that I would mind.

Sent from my phone. Please forgive brevity and mis spelling

On Jun 8, 2015, 20:22, at 20:22, jnothman [email protected] wrote:

Is it appropriate to use a different imputation strategy at train and
test
time?

On 9 June 2015 at 13:13, Gael Varoquaux [email protected]
wrote:

I don't understand how one uses classes in e.g. mean imputation.
Sure one
can model a different mean for each class, but how does that apply
at
test
time?

Good point! That's another situation when it might be useful to break
the
equivalence between fit_transform and fit + transform (we were
discussing
with @amueller that for data resampling / downsampling / upsampling
such
a choice might be useful).

—
Reply to this email directly or view it on GitHub

#2989 (comment)
.

Reply to this email directly or view it on GitHub:
#2989 (comment)

imanojkumar · 2016-01-22T07:01:19Z

You must read the article about kNN impuation "Missing value estimation methods for DNA microarrays", Olga Troyanskaya et. al., Bioinformatics, Vol. 17, No. 6, 2001, pp. 520-525.

movelikeriver · 2016-02-03T23:41:47Z

share some lib from http://scikit-learn.org/stable/modules/neighbors.html?
also need to support discrete data for distance calculation.

jachymb · 2016-02-29T13:59:40Z

+1

brucechou1983 · 2016-03-04T03:45:01Z

+1

bobcolner · 2016-05-26T16:20:23Z

+1, also random forest imputation would be great. https://cran.r-project.org/web/packages/missForest/missForest.pdf

ankitagarwal · 2016-11-20T05:58:03Z

+1, I would be happy to submit a patch, If someone can point me in direction on how tos and coding guidelines for sklearn.

jnothman · 2016-11-20T11:03:22Z

See http://scikit-learn.org/stable/developers/contributing.html but also note that there's something somewhat in the works at #4844. Maybe you could volunteer to help or take that over.

essicolo · 2017-01-24T16:55:07Z

[UPDATED after comments 1, 2] How about something like this?

import numpy as np
import random
from sklearn import datasets
from sklearn import neighbors

def impute(mat, learner, n_iter=3):
    mat = np.array(mat)
    mat_isnan = np.isnan(mat)        
    w = np.where(np.isnan(mat))
    ximp = mat.copy()
    for i in range(0, len(w[0])):
        n = w[0][i] # row where the nan is
        p = w[1][i] # column where the nan is
        col_isnan = mat_isnan[n, :] # empty columns in row n
        train = np.delete(mat, n, axis = 0) # remove row n to obtain a training set
        train_nonan = train[~np.apply_along_axis(np.any, 1, np.isnan(train)), :] # remove rows where there is a nan in the training set
        target = train_nonan[:, p] # vector to be predicted
        feature = train_nonan[:, ~col_isnan] # matrix of predictors
        learn = learner.fit(feature, target) # learner
        ximp[n, p] = learn.predict(mat[n, ~col_isnan].reshape(1, -1)) # predict and replace
    for iter in range(0, n_iter):
        for i in random.sample(range(0, len(w[0])), len(w[0])):
            n = w[0][i] # row where the nan is
            p = w[1][i] # column where the nan is
            train = np.delete(ximp, n, axis = 0) # remove row n to obtain a training set
            target = train[:, p] # vector to be predicted
            feature = np.delete(train, p, axis=1) # matrix of predictors
            learn = learner.fit(feature, target) # learner
            ximp[n, p] = learn.predict(np.delete(ximp[n,:], p).reshape(1, -1)) # predict and replace
    
    return ximp

# Impute with learner in the iris data set
iris = datasets.load_iris()
mat = iris.data.copy()

# throw some nans
mat[0,2] = np.NaN
mat[0,3] = np.NaN
mat[1,3] = np.NaN
mat[11,1] = np.NaN
mat = mat[range(30), :]

# impute
impute(mat=mat, learner=neighbors.KNeighborsRegressor(n_neighbors=3), n_iter=10)

jnothman · 2017-01-24T22:48:57Z

I think the point of KNN imputation is that it can be done relatively efficiently, even incorporating samples that have NaNs in them. Your solution is more generic, though one could also consider strategies that incorporate all training data (not just those missing no values) by using a default imputation strategy over it, or indeed iteratively updating the estimate as in the strategies implemented at https://github.com/hammerlab/fancyimpute/.

wqp89324 · 2017-03-10T15:41:38Z

@essicolo Hello, I think there may be a tiny issue in your code, for instance, for the array below:

x11, x12, na, na
x21, x22, x23, na
x31, x32, x33, x34

When you consider filling the na at first row third column, the second row should not be removed, since we will only use the first two columns as feature, and third column as target, and the na at second row, fourth column will not matter.

essicolo · 2017-03-10T18:49:42Z

@jnothman , @wqp89324

Good comments. I added an iteration loop (updated in the previous comment). In the fist loop, the imputation is done with the initial information. In the second iter loop, only the row and column of the NaN is discarded: this step is performed in a random order.

However, in the example I wrote that the number of iterations doesn't affect the outcome. I can't figure out why.

ashimb9 · 2017-06-24T18:58:14Z

Hi all -- as you might have noticed, I recently added a pull request that implements a kNN imputation algorithm. I would very much welcome and appreciate any feedback/suggestions/criticism you might have regarding the code. Also, it would be super if anybody wanted to join in, as the development is not yet complete.
With regards to the progress, I am happy to report that the implementation works fine in cases where the same data matrix is passed in fit() and transform(). And by "works fine", I mean in comparison to the R-package Impute that this implementation is inspired from. Some other things remain, and I hope to complete them in the coming days. Anyway, I look forward to hearing back from the community. Thanks! :)

gwerbin · 2017-07-06T12:59:29Z

You might also want to consider how the R package Caret does things: https://github.com/topepo/caret/blob/master/pkg/caret/R/preProcess.R#L738-L754

Caret internally use the function RANN::nn2. Not sure if there's a comparable kNN implementation in Python.

ashimb9 · 2017-07-07T08:31:13Z

I'll definitely take a look. Thanks!

amueller added the New Feature label Jan 23, 2015

twangnyc mentioned this issue Jun 10, 2015

[MRG] Add KNN strategy for imputation #4844

Closed

stharrold mentioned this issue Dec 23, 2015

20160110-etl-census-with-python.md stharrold/stharrold.github.io#32

Closed

72 tasks

stharrold mentioned this issue Jan 4, 2016

20160221-predict-household-income-from-census.md stharrold/stharrold.github.io#36

Open

ashimb9 mentioned this issue Jun 24, 2017

[MRG] Added k-Nearest Neighbor imputation for missing data #9212

Closed

7 tasks

ashimb9 mentioned this issue Jul 13, 2017

[MRG] Modified sklearn.metrics to enable euclidean distance calculation with NaN #9348

Closed

thomasjpfan mentioned this issue Feb 12, 2019

[MRG] Adds KNNImputer #12852

Merged

jnothman closed this as completed in #12852 Sep 3, 2019

Uh oh!

imputation by knn #2989

imputation by knn #2989

Comments

chitcode commented Mar 22, 2014

larsmans commented May 10, 2014

Uh oh!

jnothman commented May 10, 2014

Uh oh!

sskarkhanis commented Feb 26, 2015

Uh oh!

chitcode commented Feb 27, 2015

Uh oh!

sskarkhanis commented Feb 27, 2015

Uh oh!

ogrisel commented Feb 27, 2015

Uh oh!

GaelVaroquaux commented Feb 27, 2015 via email

Uh oh!

sskarkhanis commented Feb 27, 2015

Uh oh!

ogrisel commented Feb 27, 2015

Uh oh!

sskarkhanis commented Feb 27, 2015

Uh oh!

amueller commented Jun 8, 2015

Uh oh!

amueller commented Jun 8, 2015

Uh oh!

GaelVaroquaux commented Jun 8, 2015

Uh oh!

amueller commented Jun 8, 2015

Uh oh!

GaelVaroquaux commented Jun 8, 2015

Uh oh!

jnothman commented Jun 9, 2015

Uh oh!

GaelVaroquaux commented Jun 9, 2015

Uh oh!

jnothman commented Jun 9, 2015

Uh oh!

GaelVaroquaux commented Jun 9, 2015

Uh oh!

imanojkumar commented Jan 22, 2016

Uh oh!

movelikeriver commented Feb 3, 2016

Uh oh!

jachymb commented Feb 29, 2016

Uh oh!

brucechou1983 commented Mar 4, 2016

Uh oh!

bobcolner commented May 26, 2016

Uh oh!

ankitagarwal commented Nov 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Nov 20, 2016

Uh oh!

essicolo commented Jan 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Jan 24, 2017

Uh oh!

wqp89324 commented Mar 10, 2017

Uh oh!

essicolo commented Mar 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashimb9 commented Jun 24, 2017

Uh oh!

gwerbin commented Jul 6, 2017

Uh oh!

ashimb9 commented Jul 7, 2017

Uh oh!

ankitagarwal commented Nov 20, 2016 •

edited

Loading

essicolo commented Jan 24, 2017 •

edited

Loading

essicolo commented Mar 10, 2017 •

edited

Loading