Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Added k-Nearest Neighbor imputation for missing data #9212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 106 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
287eb1e
Added k-Nearest Neighbor imputation of missing data
ashimb9 Jun 24, 2017
cd6d3a2
Fixed issue with passing seperate matrices in fit() and transform()
ashimb9 Jun 25, 2017
3fc9596
Retreived fitted data with self.statistics_ rather than passing it as…
ashimb9 Jun 25, 2017
d707dcd
Modified metrics to enable euclidean distance calculation with missin…
ashimb9 Jul 13, 2017
b4b5ae9
Changes to ensure Python 2.x compatibility
ashimb9 Jul 18, 2017
04ed4a0
Fixed pep8 issues
ashimb9 Jul 18, 2017
a6d8ef6
Addressed comments from review
ashimb9 Jul 19, 2017
e4f8612
Docstring example issues
ashimb9 Jul 19, 2017
daf247f
Formatting fixes on docstring
ashimb9 Jul 19, 2017
10f5adb
And yet more fixes
ashimb9 Jul 19, 2017
22cf9ef
Addressed review comments (Part 2)
ashimb9 Jul 23, 2017
2482c8a
Changed nan-mask from int8 to int32
ashimb9 Jul 23, 2017
66527cd
Addressed review comments (#3)
ashimb9 Jul 24, 2017
a968b1e
Pep8 fix
ashimb9 Jul 24, 2017
356c8e8
Comment edit on test_pairwise
ashimb9 Jul 24, 2017
d6aeaf3
Addressed review comments #4
ashimb9 Jul 25, 2017
e8ccdee
replaced or with in
ashimb9 Jul 25, 2017
4a8309b
Changed allow_nans assignment
ashimb9 Jul 25, 2017
5cbc156
One more or to in
ashimb9 Jul 25, 2017
a31c43a
Addressed review comments #5
ashimb9 Jul 31, 2017
eacb19d
Edited comments
ashimb9 Jul 31, 2017
d4049e2
Merge branch 'naneuclid' into knnimpute
ashimb9 Jul 31, 2017
cfb7c97
KNN Imputation with masked_euclidean and sklearn.neighbors
ashimb9 Aug 3, 2017
aa8547a
fixed array base check
ashimb9 Aug 3, 2017
009efa9
Fix column mean to nanmean
ashimb9 Aug 3, 2017
70f294a
Added weight support and cleaned the code
ashimb9 Aug 6, 2017
a54c162
Added inf check
ashimb9 Aug 6, 2017
c412e3b
Changed error message
ashimb9 Aug 6, 2017
ffe6774
Added test suite and example. Expanded docstring description
ashimb9 Aug 8, 2017
c2d6a6c
Changes to preprocessing __init__
ashimb9 Aug 8, 2017
9a19677
Added KNNImputer exception for NaN and inf in estimator_checks
ashimb9 Aug 8, 2017
a6a0a2f
Moved _check_weights() to fit()
ashimb9 Aug 9, 2017
4fbbe40
Addressed review comments - 1
ashimb9 Aug 18, 2017
29bdccb
Make NearestNeighbor import local to fit
ashimb9 Aug 18, 2017
6bb5471
Updated doc/modules/preprocessing.rst
ashimb9 Aug 18, 2017
e393cb0
More circular import fixes
ashimb9 Aug 18, 2017
6e5ec30
pep8 fixes
ashimb9 Aug 18, 2017
dd027f9
Minor comment updates
ashimb9 Aug 18, 2017
f33bff4
Addressed review comments (part 2)
ashimb9 Aug 20, 2017
2e1ea48
Fixed pyflex issues
ashimb9 Aug 20, 2017
1098499
Added test for callable weights and updated comments.
ashimb9 Sep 3, 2017
a698120
Pep8 fixes
ashimb9 Sep 3, 2017
95e0f56
Comment, doc, and pep8 fixes
ashimb9 Sep 15, 2017
215c8c9
Docstring changes
ashimb9 Sep 15, 2017
fab313b
Changes to unit tests as per review comments
ashimb9 Sep 15, 2017
b2d5640
Tests moved to test_imputation
ashimb9 Sep 15, 2017
cd90614
Addressed review comments
ashimb9 Sep 19, 2017
2c9993a
test changes
ashimb9 Sep 19, 2017
473b191
Test changes part 2
ashimb9 Sep 19, 2017
de587b3
Fixed weight matrix shape issue
ashimb9 Sep 21, 2017
3d58616
Minor changes
ashimb9 Sep 21, 2017
5873d17
Fixed degenerate donor issue. Added tests
ashimb9 Sep 22, 2017
fd11002
Further test updates
ashimb9 Sep 22, 2017
2f41aa2
minor test fix
ashimb9 Sep 23, 2017
135056c
more minor changes
ashimb9 Sep 24, 2017
8c7190e
Moved weight_matrix inside if-weighted block
ashimb9 Sep 24, 2017
9616c2b
Addressed Review Comments
ashimb9 Dec 12, 2017
7e8f900
Fixed plot_missing example
ashimb9 Dec 12, 2017
df9dba7
Fixed Error Msg
ashimb9 Dec 12, 2017
d26724a
Modified missing check for sparse matrix
ashimb9 Dec 12, 2017
2b327da
Test update
ashimb9 Dec 12, 2017
1704672
Fixed nan check on sparse
ashimb9 Dec 17, 2017
a1cc41d
Review Comments Addressed (partial)
ashimb9 Dec 17, 2017
1417f3e
Fix merge conflit
ashimb9 Dec 19, 2017
34f68a5
Updated doc module
ashimb9 Dec 19, 2017
508270c
Added support for using only neighbors with non-missing features
ashimb9 Jan 26, 2018
0562054
Test update
ashimb9 Jan 26, 2018
24943ec
Import Numpy code for np.unique for older versions
ashimb9 Jan 26, 2018
a449c5b
Remove version check
ashimb9 Jan 26, 2018
a485db9
Minor fix
ashimb9 Jan 26, 2018
6058548
Added strategy to only use neighbors with non-nan value
ashimb9 Mar 28, 2018
1abbce8
Sync with upstream and merge with master
ashimb9 Mar 31, 2018
0b67233
Edit import path in test file
ashimb9 Mar 31, 2018
3e08209
Error fixes with imports and examples
ashimb9 Mar 31, 2018
851ab3c
Added use_complete docstring
ashimb9 Mar 31, 2018
7a0647f
Changed comments and fixed docstring
ashimb9 Mar 31, 2018
b17906f
Added more doctest fix and min neighbor check
ashimb9 Mar 31, 2018
bd6eb69
fix docs
ashimb9 Mar 31, 2018
2ea131b
Increase col_max_missing threshold for example plot
ashimb9 Mar 31, 2018
b1d9397
Lower missing rate in demo since tests are failing
ashimb9 Mar 31, 2018
d7cbdfb
Remove redundant check and changes in plot
ashimb9 Mar 31, 2018
1c9d858
Handling insufficient neighbors scenario
ashimb9 Mar 31, 2018
01722f1
Removed k actual neighbors algo
ashimb9 Apr 7, 2018
36d1d72
Addressed Comments
ashimb9 Apr 22, 2018
95f15ff
Merge branch 'master' into knnimpute
ashimb9 Apr 22, 2018
8e82d0d
Sync with upstream and merge
ashimb9 Apr 22, 2018
f463b15
Sync and merge
ashimb9 Apr 22, 2018
8a16e28
Minor bug fixes
ashimb9 Apr 28, 2018
a93827c
Removing flotsam
ashimb9 Apr 28, 2018
5de5b60
Minor bug fixes
ashimb9 Apr 29, 2018
eddf18f
Merge to upstream
ashimb9 May 26, 2018
2058186
Revert changes to sklearn/neighbors
jnothman Sep 30, 2018
69f2b7f
Merge branch 'master' into knnimpute
jnothman Sep 30, 2018
202cd37
Revert changes to deprecated file
jnothman Sep 30, 2018
6414081
COSMIT _MASKED_METRICS -> _NAN_METRICS
jnothman Sep 30, 2018
2825fcc
'NaN' no longer stands for NaN
jnothman Sep 30, 2018
745fa2d
Fix missing_values validation
jnothman Oct 3, 2018
44f0210
Attempt to reinstate neighbors changes
jnothman Oct 3, 2018
82d5d20
Fix up test failures
jnothman Oct 3, 2018
d8b23e6
Fix flake8 issues in example
jnothman Oct 3, 2018
c682361
Default force_all_finite to True rather than False
jnothman Oct 4, 2018
1912611
Fix example usage
jnothman Oct 4, 2018
607ff7f
Fix masked_euclidean testing in nearest neighbors
jnothman Oct 4, 2018
87677e7
Fix missing_values in masked_euclidean_distances
jnothman Oct 4, 2018
39e1da8
Can't subtract list and set in Py2
jnothman Oct 4, 2018
e1afa12
Merge branch 'master' into knnimpute
jnothman Jan 17, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions doc/modules/impute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,14 @@ values. However, this comes at the price of losing data which may be valuable
i.e., to infer them from the known part of the data. See the :ref:`glossary`
entry on imputation.

Imputer transformers can be used in a Pipeline as a way to build a composite
estimator that supports imputation. See
:ref:`sphx_glr_auto_examples_plot_missing_values.py`.


Simple univariate imputation
============================

The :class:`SimpleImputer` class provides basic strategies for imputing missing
values. Missing values can be imputed with a provided constant value, or using
the statistics (mean, median or most frequent) of each column in which the
Expand Down Expand Up @@ -75,6 +83,48 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
:class:`SimpleImputer` can be used in a Pipeline as a way to build a composite
estimator that supports imputation. See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.

.. _knnimpute:

Nearest neighbors imputation
===============================

The :class:`KNNImputer` class provides imputation for completing missing
values using the k-Nearest Neighbors approach. Each sample's missing values
are imputed using values from ``n_neighbors`` nearest neighbors found in the
training set. Note that if a sample has more than one feature missing, then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is a bit cryptic to me. Maybe say "If multiple features are missing, not the sets of neighbors used for imputation can be different" or something like that?

the sample can potentially have multiple sets of ``n_neighbors``
donors depending on the particular feature being imputed.

Each missing feature is then imputed as the average, either weighted or
unweighted, of these neighbors. Where the number of donor neighbors is less
than ``n_neighbors``, the training set average for that feature is used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this sentence a bit cryptic. Maybe define donors somewhere?

for imputation. The total number of samples in the training set is, of course,
always greater than or equal to the number of nearest neighbors available for
imputation, depending on both the overall sample size as well as the number of
samples excluded from nearest neighbor calculation because of too many missing
features (as controlled by ``row_max_missing``).
For more information on the methodology, see ref. [#]_.

The following snippet demonstrates how to replace missing values,
encoded as ``np.nan``, using the mean feature value of the two nearest
neighbors of the rows that contain the missing values::

>>> import numpy as np
>>> from sklearn.impute import KNNImputer
>>> nan = np.nan
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],
[3. , 4. , 3. ],
[5.5, 6. , 5. ],
[8. , 8. , 7. ]])

.. [#] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor
Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value
estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001
Pages 520-525.

.. _missing_indicator:

Marking imputed values
Expand Down
1 change: 1 addition & 0 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -672,6 +672,7 @@ Imputation of missing values

Tools for imputing missing values are discussed at :ref:`impute`.


.. _polynomial_features:

Generating polynomial features
Expand Down
22 changes: 17 additions & 5 deletions examples/plot_missing_values.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,20 @@
The median is a more robust estimator for data with high magnitude variables
which could dominate results (otherwise known as a 'long tail').

With ``KNNImputer``, missing values can be imputed using the weighted
or unweighted mean of the desired number of nearest neighbors.

In addition of using an imputing method, we can also keep an indication of the
missing information using :func:`sklearn.impute.MissingIndicator` which might
carry some information.
"""
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_diabetes
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline, make_union
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.impute import SimpleImputer, KNNImputer, MissingIndicator
from sklearn.model_selection import cross_val_score

rng = np.random.RandomState(0)
Expand Down Expand Up @@ -72,10 +74,18 @@ def get_results(dataset):
scoring='neg_mean_squared_error',
cv=5)

# Estimate the score after kNN-imputation of the missing values
knn_estimator = make_pipeline(
KNNImputer(missing_values=0, col_max_missing=0.99),
RandomForestRegressor(random_state=0, n_estimators=100))
knn_impute_scores = cross_val_score(knn_estimator, X_missing, y_missing,
scoring='neg_mean_squared_error')

return ((full_scores.mean(), full_scores.std()),
(zero_impute_scores.mean(), zero_impute_scores.std()),
(mean_impute_scores.mean(), mean_impute_scores.std()))
(mean_impute_scores.mean(), mean_impute_scores.std()),
(knn_impute_scores.mean(), knn_impute_scores.std()),
)


results_diabetes = np.array(get_results(load_diabetes()))
Expand All @@ -91,8 +101,10 @@ def get_results(dataset):

x_labels = ['Full data',
'Zero imputation',
'Mean Imputation']
colors = ['r', 'g', 'b', 'orange']
'Mean Imputation',
'KNN Imputation',
]
colors = ['r', 'g', 'b', 'orange', 'black']

# plot diabetes results
plt.figure(figsize=(12, 6))
Expand Down
Loading