-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Added k-Nearest Neighbor imputation for missing data #9212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
106 commits
Select commit
Hold shift + click to select a range
287eb1e
Added k-Nearest Neighbor imputation of missing data
ashimb9 cd6d3a2
Fixed issue with passing seperate matrices in fit() and transform()
ashimb9 3fc9596
Retreived fitted data with self.statistics_ rather than passing it as…
ashimb9 d707dcd
Modified metrics to enable euclidean distance calculation with missin…
ashimb9 b4b5ae9
Changes to ensure Python 2.x compatibility
ashimb9 04ed4a0
Fixed pep8 issues
ashimb9 a6d8ef6
Addressed comments from review
ashimb9 e4f8612
Docstring example issues
ashimb9 daf247f
Formatting fixes on docstring
ashimb9 10f5adb
And yet more fixes
ashimb9 22cf9ef
Addressed review comments (Part 2)
ashimb9 2482c8a
Changed nan-mask from int8 to int32
ashimb9 66527cd
Addressed review comments (#3)
ashimb9 a968b1e
Pep8 fix
ashimb9 356c8e8
Comment edit on test_pairwise
ashimb9 d6aeaf3
Addressed review comments #4
ashimb9 e8ccdee
replaced or with in
ashimb9 4a8309b
Changed allow_nans assignment
ashimb9 5cbc156
One more or to in
ashimb9 a31c43a
Addressed review comments #5
ashimb9 eacb19d
Edited comments
ashimb9 d4049e2
Merge branch 'naneuclid' into knnimpute
ashimb9 cfb7c97
KNN Imputation with masked_euclidean and sklearn.neighbors
ashimb9 aa8547a
fixed array base check
ashimb9 009efa9
Fix column mean to nanmean
ashimb9 70f294a
Added weight support and cleaned the code
ashimb9 a54c162
Added inf check
ashimb9 c412e3b
Changed error message
ashimb9 ffe6774
Added test suite and example. Expanded docstring description
ashimb9 c2d6a6c
Changes to preprocessing __init__
ashimb9 9a19677
Added KNNImputer exception for NaN and inf in estimator_checks
ashimb9 a6a0a2f
Moved _check_weights() to fit()
ashimb9 4fbbe40
Addressed review comments - 1
ashimb9 29bdccb
Make NearestNeighbor import local to fit
ashimb9 6bb5471
Updated doc/modules/preprocessing.rst
ashimb9 e393cb0
More circular import fixes
ashimb9 6e5ec30
pep8 fixes
ashimb9 dd027f9
Minor comment updates
ashimb9 f33bff4
Addressed review comments (part 2)
ashimb9 2e1ea48
Fixed pyflex issues
ashimb9 1098499
Added test for callable weights and updated comments.
ashimb9 a698120
Pep8 fixes
ashimb9 95e0f56
Comment, doc, and pep8 fixes
ashimb9 215c8c9
Docstring changes
ashimb9 fab313b
Changes to unit tests as per review comments
ashimb9 b2d5640
Tests moved to test_imputation
ashimb9 cd90614
Addressed review comments
ashimb9 2c9993a
test changes
ashimb9 473b191
Test changes part 2
ashimb9 de587b3
Fixed weight matrix shape issue
ashimb9 3d58616
Minor changes
ashimb9 5873d17
Fixed degenerate donor issue. Added tests
ashimb9 fd11002
Further test updates
ashimb9 2f41aa2
minor test fix
ashimb9 135056c
more minor changes
ashimb9 8c7190e
Moved weight_matrix inside if-weighted block
ashimb9 9616c2b
Addressed Review Comments
ashimb9 7e8f900
Fixed plot_missing example
ashimb9 df9dba7
Fixed Error Msg
ashimb9 d26724a
Modified missing check for sparse matrix
ashimb9 2b327da
Test update
ashimb9 1704672
Fixed nan check on sparse
ashimb9 a1cc41d
Review Comments Addressed (partial)
ashimb9 1417f3e
Fix merge conflit
ashimb9 34f68a5
Updated doc module
ashimb9 508270c
Added support for using only neighbors with non-missing features
ashimb9 0562054
Test update
ashimb9 24943ec
Import Numpy code for np.unique for older versions
ashimb9 a449c5b
Remove version check
ashimb9 a485db9
Minor fix
ashimb9 6058548
Added strategy to only use neighbors with non-nan value
ashimb9 1abbce8
Sync with upstream and merge with master
ashimb9 0b67233
Edit import path in test file
ashimb9 3e08209
Error fixes with imports and examples
ashimb9 851ab3c
Added use_complete docstring
ashimb9 7a0647f
Changed comments and fixed docstring
ashimb9 b17906f
Added more doctest fix and min neighbor check
ashimb9 bd6eb69
fix docs
ashimb9 2ea131b
Increase col_max_missing threshold for example plot
ashimb9 b1d9397
Lower missing rate in demo since tests are failing
ashimb9 d7cbdfb
Remove redundant check and changes in plot
ashimb9 1c9d858
Handling insufficient neighbors scenario
ashimb9 01722f1
Removed k actual neighbors algo
ashimb9 36d1d72
Addressed Comments
ashimb9 95f15ff
Merge branch 'master' into knnimpute
ashimb9 8e82d0d
Sync with upstream and merge
ashimb9 f463b15
Sync and merge
ashimb9 8a16e28
Minor bug fixes
ashimb9 a93827c
Removing flotsam
ashimb9 5de5b60
Minor bug fixes
ashimb9 eddf18f
Merge to upstream
ashimb9 2058186
Revert changes to sklearn/neighbors
jnothman 69f2b7f
Merge branch 'master' into knnimpute
jnothman 202cd37
Revert changes to deprecated file
jnothman 6414081
COSMIT _MASKED_METRICS -> _NAN_METRICS
jnothman 2825fcc
'NaN' no longer stands for NaN
jnothman 745fa2d
Fix missing_values validation
jnothman 44f0210
Attempt to reinstate neighbors changes
jnothman 82d5d20
Fix up test failures
jnothman d8b23e6
Fix flake8 issues in example
jnothman c682361
Default force_all_finite to True rather than False
jnothman 1912611
Fix example usage
jnothman 607ff7f
Fix masked_euclidean testing in nearest neighbors
jnothman 87677e7
Fix missing_values in masked_euclidean_distances
jnothman 39e1da8
Can't subtract list and set in Py2
jnothman e1afa12
Merge branch 'master' into knnimpute
jnothman File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,14 @@ values. However, this comes at the price of losing data which may be valuable | |
i.e., to infer them from the known part of the data. See the :ref:`glossary` | ||
entry on imputation. | ||
|
||
Imputer transformers can be used in a Pipeline as a way to build a composite | ||
estimator that supports imputation. See | ||
:ref:`sphx_glr_auto_examples_plot_missing_values.py`. | ||
|
||
|
||
Simple univariate imputation | ||
============================ | ||
|
||
The :class:`SimpleImputer` class provides basic strategies for imputing missing | ||
values. Missing values can be imputed with a provided constant value, or using | ||
the statistics (mean, median or most frequent) of each column in which the | ||
|
@@ -75,6 +83,48 @@ string values or pandas categoricals when using the ``'most_frequent'`` or | |
:class:`SimpleImputer` can be used in a Pipeline as a way to build a composite | ||
estimator that supports imputation. See :ref:`sphx_glr_auto_examples_plot_missing_values.py`. | ||
|
||
.. _knnimpute: | ||
|
||
Nearest neighbors imputation | ||
=============================== | ||
|
||
The :class:`KNNImputer` class provides imputation for completing missing | ||
values using the k-Nearest Neighbors approach. Each sample's missing values | ||
are imputed using values from ``n_neighbors`` nearest neighbors found in the | ||
training set. Note that if a sample has more than one feature missing, then | ||
the sample can potentially have multiple sets of ``n_neighbors`` | ||
donors depending on the particular feature being imputed. | ||
|
||
Each missing feature is then imputed as the average, either weighted or | ||
unweighted, of these neighbors. Where the number of donor neighbors is less | ||
than ``n_neighbors``, the training set average for that feature is used | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find this sentence a bit cryptic. Maybe define donors somewhere? |
||
for imputation. The total number of samples in the training set is, of course, | ||
always greater than or equal to the number of nearest neighbors available for | ||
imputation, depending on both the overall sample size as well as the number of | ||
samples excluded from nearest neighbor calculation because of too many missing | ||
features (as controlled by ``row_max_missing``). | ||
For more information on the methodology, see ref. [#]_. | ||
|
||
The following snippet demonstrates how to replace missing values, | ||
encoded as ``np.nan``, using the mean feature value of the two nearest | ||
neighbors of the rows that contain the missing values:: | ||
|
||
>>> import numpy as np | ||
>>> from sklearn.impute import KNNImputer | ||
>>> nan = np.nan | ||
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]] | ||
>>> imputer = KNNImputer(n_neighbors=2, weights="uniform") | ||
>>> imputer.fit_transform(X) | ||
array([[1. , 2. , 4. ], | ||
[3. , 4. , 3. ], | ||
[5.5, 6. , 5. ], | ||
[8. , 8. , 7. ]]) | ||
|
||
.. [#] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor | ||
Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value | ||
estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 | ||
Pages 520-525. | ||
|
||
.. _missing_indicator: | ||
|
||
Marking imputed values | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is a bit cryptic to me. Maybe say "If multiple features are missing, not the sets of neighbors used for imputation can be different" or something like that?