Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[WIP] Allow NaN/Inf in univariate feature selectors #15434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
03eb769
removed check for nan/inf in SelectorMixin.tranform, RFE.fit, and RFE…
Jul 19, 2018
03042a2
removed extra blank line
Jul 19, 2018
d6ffba7
added tests for changes to RFE/RFECV and SelectorMixin
Jul 19, 2018
da4c5f4
pep8 changes and pytest.mark.parametrize for nan_inf tests
Jan 8, 2019
d3601f4
added force_all_finite=False for univariate feature selection fit
Jan 8, 2019
27e1399
added tests for nan and inf being allowed in univariate select
Jan 8, 2019
2b06f2a
added test for nan and inf being allowed in SelectorMixin transform
Jan 8, 2019
fb7d7d6
updated checks to not run check_estimators_nan_inf on feature selectors
Jan 8, 2019
4d4662c
pep8 corrections
Jan 8, 2019
eadb804
created dummy_score function for use by all tests needing a fake scorer
Jan 8, 2019
da866eb
fixed travis test error with RFECV not having cv specified
Jan 8, 2019
9f2ff90
Merge branch 'master' into allow-nan-inf-in-feature-selection
Jan 8, 2019
f7c2fb8
allowed no NaN in pickle check for estimators in ALLOW_NAN
Jan 9, 2019
288cecc
added support for nan/inf in VarianceThreshold.fit, added tests for it
Jan 11, 2019
67fa9e7
added whats new documentation for feature selection allowing NaN/Inf
Jan 17, 2019
9be3870
added Notes for modified feature selectors about allowing NaN/Inf
Jan 17, 2019
c9e8fcc
fixed pep8 error
Jan 17, 2019
34e2710
shortened line length for whats new documentation
Jan 17, 2019
816b0a3
changed VarianceThreshold to only allow nan, not inf. Updated tests f…
Jan 23, 2019
c4619eb
fixed phrasing in whats_new for #11635
Jan 23, 2019
195ed13
added test for all-NaN feature in VarianceThreshold
Jan 24, 2019
87c771a
Merge branch 'master' into allow-nan-inf-in-feature-selection
adpeters Jan 30, 2019
a79155d
Merge branch 'master' into allow-nan-inf-in-feature-selection
adpeters Feb 5, 2019
620726a
Merge branch 'master' into allow-nan-inf-in-feature-selection
adpeters Feb 22, 2019
fe60c94
Merge branch 'master' into HEAD
jnothman Apr 24, 2019
cca7acc
Fix use of check_array
jnothman Apr 24, 2019
017a0db
Use check_X_y
jnothman Apr 24, 2019
0efc7f0
updated SelectFromModel allow_nan tag to inherit from its estimator
May 23, 2019
35878e8
Merge branch 'master' into allow-nan-inf-in-feature-selection
May 23, 2019
4119fcd
moved documentation for #11635 to v0.22
May 23, 2019
9f20985
fixed formatting
May 23, 2019
fc1a975
removed test_transform_accepts_nan_inf from test_from_model.py since …
May 23, 2019
f710408
updated test_transform_accepts_nan_inf to handle NaN tag
May 23, 2019
2b55934
fixed formatting
May 23, 2019
1adc65c
fixed formatting
May 23, 2019
055d4ea
set allow_nan tag to false for univariate feature selectors since the…
adpeters Oct 29, 2019
61d8277
merge with master
adpeters Oct 29, 2019
4499960
updated VarianceThreshold.fit() to handle nan when threshold is 0
adpeters Oct 30, 2019
92bc14f
removed test that no longer applies now that allow_nan tag is False f…
adpeters Oct 30, 2019
e1add7f
added pytest import since it was accidentally removed
adpeters Oct 30, 2019
4921954
removed duplicate numpy import
adpeters Oct 30, 2019
6783182
updated assert_equals to assert
adpeters Oct 30, 2019
4d26ec7
update docs, added FIXME where we need to figure out how to determine…
adpeters Nov 1, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion doc/whats_new/v0.22.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,6 @@ Changelog
:pr:`123456` by :user:`Joe Bloggs <joeongithub>`.
where 123456 is the *pull request* number, not the issue number.


:mod:`sklearn.base`
...................

Expand Down Expand Up @@ -311,6 +310,25 @@ Changelog
:mod:`sklearn.feature_selection`
................................

- |Enhancement| Updated the following :mod:`feature_selection` estimators to allow
NaN/Inf values in ``transform`` and ``fit``:
:class:`feature_selection.RFE`, :class:`feature_selection.RFECV`,
:class:`feature_selection.SelectFromModel`,
and :class:`feature_selection.VarianceThreshold`. Note that if the underlying
estimator of the feature selector does not allow NaN/Inf then it will still
error, but the feature selectors themselves no longer enforce this
restriction unnecessarily. :issue:`11635` by :user:`Alec Peters <adpeters>`.

- |Enhancement| Updated univariate :mod:`feature_selection` estimators to allow
NaN/Inf values in ``transform`` and ``fit``. This includes
:class:`feature_selection.GenericUnivariateSelect`,
:class:`feature_selection.SelectFdr`, :class:`feature_selection.SelectFpr`,
:class:`feature_selection.SelectFwe`, :class:`feature_selection.SelectKBest`,
:class:`feature_selection.SelectPercentile`. Note that if the underlying
score function of the feature selector does not allow NaN/Inf then it will still
error, but the feature selectors themselves no longer enforce this
restriction unnecessarily. :pr:`15434` by :user:`Alec Peters <adpeters>`.

- |Fix| Fixed a bug where :class:`feature_selection.VarianceThreshold` with
`threshold=0` did not remove constant features due to numerical instability,
by using range rather than variance in this case.
Expand Down
4 changes: 3 additions & 1 deletion sklearn/feature_selection/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,9 @@ def transform(self, X):
X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
"""
X = check_array(X, dtype=None, accept_sparse='csr')
tags = self._get_tags()
X = check_array(X, dtype=None, accept_sparse='csr',
force_all_finite=not tags.get('allow_nan', True))
mask = self.get_support()
if not mask.any():
warn("No features were selected: either the data is"
Expand Down
8 changes: 8 additions & 0 deletions sklearn/feature_selection/_from_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,10 @@ class SelectFromModel(MetaEstimatorMixin, SelectorMixin, BaseEstimator):
threshold_ : float
The threshold value used for feature selection.

Notes
-----
Allows NaN/Inf in the input if the underlying estimator does as well.

Examples
--------
>>> from sklearn.feature_selection import SelectFromModel
Expand Down Expand Up @@ -249,3 +253,7 @@ def partial_fit(self, X, y=None, **fit_params):
self.estimator_ = clone(self.estimator)
self.estimator_.partial_fit(X, y, **fit_params)
return self

def _more_tags(self):
estimator_tags = self.estimator._get_tags()
return {'allow_nan': estimator_tags.get('allow_nan', True)}
14 changes: 11 additions & 3 deletions sklearn/feature_selection/_rfe.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,10 @@ class RFE(SelectorMixin, MetaEstimatorMixin, BaseEstimator):
>>> selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])

Notes
-----
Allows NaN/Inf in the input if the underlying estimator does as well.

See also
--------
RFECV : Recursive feature elimination with built-in cross-validated
Expand Down Expand Up @@ -150,7 +154,8 @@ def _fit(self, X, y, step_score=None):
# and is used when implementing RFECV
# self.scores_ will not be calculated when calling _fit through fit

X, y = check_X_y(X, y, "csc", ensure_min_features=2)
X, y = check_X_y(X, y, "csc", ensure_min_features=2,
force_all_finite=False)
# Initialization
n_features = X.shape[1]
if self.n_features_to_select is None:
Expand Down Expand Up @@ -326,7 +331,7 @@ def predict_log_proba(self, X):
return self.estimator_.predict_log_proba(self.transform(X))

def _more_tags(self):
return {'poor_score': True}
return {'poor_score': True, 'allow_nan': True}


class RFECV(RFE):
Expand Down Expand Up @@ -421,6 +426,8 @@ class RFECV(RFE):
``ceil((n_features - min_features_to_select) / step) + 1``,
where step is the number of features removed at each iteration.

Allows NaN/Inf in the input if the underlying estimator does as well.

Examples
--------
The following example shows how to retrieve the a-priori not known 5
Expand Down Expand Up @@ -479,7 +486,8 @@ def fit(self, X, y, groups=None):
train/test set. Only used in conjunction with a "Group" :term:`cv`
instance (e.g., :class:`~sklearn.model_selection.GroupKFold`).
"""
X, y = check_X_y(X, y, "csr", ensure_min_features=2)
X, y = check_X_y(X, y, "csr", ensure_min_features=2,
force_all_finite=False)

# Initialization
cv = check_cv(self.cv, y, is_classifier(self.estimator))
Expand Down
28 changes: 27 additions & 1 deletion sklearn/feature_selection/_univariate_selection.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,9 @@ def fit(self, X, y):
-------
self : object
"""
X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
X, y = check_X_y(X, y, ['csr', 'csc'],
force_all_finite=not tags.get('allow_nan', True),
multi_output=True)

if not callable(self.score_func):
raise TypeError("The score function should be a callable, %s (%s) "
Expand All @@ -361,6 +363,10 @@ def fit(self, X, y):
def _check_params(self, X, y):
pass

#FIXME: how do we determine the tags when it depends on the underlying score_func, which does not have tags?
def _more_tags(self):
return {'allow_nan': False}


######################################################################
# Specific filters
Expand Down Expand Up @@ -405,6 +411,8 @@ class SelectPercentile(_BaseFilter):
Ties between features with equal scores will be broken in an unspecified
way.

Allows NaN/Inf in the input if the underlying score_func does as well.

See also
--------
f_classif: ANOVA F-value between label/feature for classification tasks.
Expand Down Expand Up @@ -489,6 +497,8 @@ class SelectKBest(_BaseFilter):
Ties between features with equal scores will be broken in an unspecified
way.

Allows NaN/Inf in the input if the underlying score_func does as well.

See also
--------
f_classif: ANOVA F-value between label/feature for classification tasks.
Expand Down Expand Up @@ -568,6 +578,10 @@ class SelectFpr(_BaseFilter):
>>> X_new.shape
(569, 16)

Notes
-----
Allows NaN/Inf in the input if the underlying score_func does as well.

See also
--------
f_classif: ANOVA F-value between label/feature for classification tasks.
Expand Down Expand Up @@ -634,6 +648,10 @@ class SelectFdr(_BaseFilter):
----------
https://en.wikipedia.org/wiki/False_discovery_rate

Notes
-----
Allows NaN/Inf in the input if the underlying score_func does as well.

See also
--------
f_classif: ANOVA F-value between label/feature for classification tasks.
Expand Down Expand Up @@ -699,6 +717,10 @@ class SelectFwe(_BaseFilter):
pvalues_ : array-like of shape (n_features,)
p-values of feature scores.

Notes
-----
Allows NaN/Inf in the input if the underlying score_func does as well.

See also
--------
f_classif: ANOVA F-value between label/feature for classification tasks.
Expand Down Expand Up @@ -765,6 +787,10 @@ class GenericUnivariateSelect(_BaseFilter):
>>> X_new.shape
(569, 20)

Notes
-----
Allows NaN/Inf in the input if the underlying score_func does as well.

See also
--------
f_classif: ANOVA F-value between label/feature for classification tasks.
Expand Down
18 changes: 14 additions & 4 deletions sklearn/feature_selection/_variance_threshold.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,10 @@ class VarianceThreshold(SelectorMixin, BaseEstimator):
variances_ : array, shape (n_features,)
Variances of individual features.

Notes
-----
Allows NaN in the input.

Examples
--------
The following dataset has integer features, two of which are the same
Expand Down Expand Up @@ -61,24 +65,27 @@ def fit(self, X, y=None):
-------
self
"""
X = check_array(X, ('csr', 'csc'), dtype=np.float64)
X = check_array(X, ('csr', 'csc'), dtype=np.float64,
force_all_finite='allow-nan')

if hasattr(X, "toarray"): # sparse matrix
_, self.variances_ = mean_variance_axis(X, axis=0)
if self.threshold == 0:
mins, maxes = min_max_axis(X, axis=0)
peak_to_peaks = maxes - mins
else:
self.variances_ = np.var(X, axis=0)
self.variances_ = np.nanvar(X, axis=0)
if self.threshold == 0:
peak_to_peaks = np.ptp(X, axis=0)

if self.threshold == 0:
# Use peak-to-peak to avoid numeric precision issues
# for constant features
self.variances_ = np.minimum(self.variances_, peak_to_peaks)
compare_arr = np.array([self.variances_, peak_to_peaks])
self.variances_ = np.nanmin(compare_arr, axis=0)

if np.all(self.variances_ <= self.threshold):
if np.all(~np.isfinite(self.variances_) |
(self.variances_ <= self.threshold)):
msg = "No feature in X meets the variance threshold {0:.5f}"
if X.shape[0] == 1:
msg += " (X contains only one sample)"
Expand All @@ -90,3 +97,6 @@ def _get_support_mask(self):
check_is_fitted(self)

return self.variances_ > self.threshold

def _more_tags(self):
return {'allow_nan': True}
19 changes: 17 additions & 2 deletions sklearn/feature_selection/tests/test_feature_select.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,12 @@


##############################################################################

# dummy scorer to test other functionality
def dummy_score(X, y):
return X[0], X[0]


# Test the score functions

def test_f_oneway_vs_scipy_stats():
Expand Down Expand Up @@ -471,7 +477,6 @@ def test_selectkbest_tiebreaking():
# Prior to 0.11, SelectKBest would return more features than requested.
Xs = [[0, 1, 1], [0, 0, 1], [1, 0, 0], [1, 1, 0]]
y = [1]
dummy_score = lambda X, y: (X[0], X[0])
for X in Xs:
sel = SelectKBest(dummy_score, k=1)
X1 = ignore_warnings(sel.fit_transform)([X], y)
Expand All @@ -488,7 +493,6 @@ def test_selectpercentile_tiebreaking():
# Test if SelectPercentile selects the right n_features in case of ties.
Xs = [[0, 1, 1], [0, 0, 1], [1, 0, 0], [1, 1, 0]]
y = [1]
dummy_score = lambda X, y: (X[0], X[0])
for X in Xs:
sel = SelectPercentile(dummy_score, percentile=34)
X1 = ignore_warnings(sel.fit_transform)([X], y)
Expand Down Expand Up @@ -667,3 +671,14 @@ def test_mutual_info_regression():
gtruth = np.zeros(10)
gtruth[:2] = 1
assert_array_equal(support, gtruth)


def test_univariate_nan_inf_allowed_in_fit():
X, y = make_regression(n_samples=100, n_features=10, n_informative=2,
shuffle=False, random_state=0, noise=10)

univariate_filter = GenericUnivariateSelect(dummy_score,
mode='percentile')
X[0] = np.NaN
X[1] = np.Inf
univariate_filter.fit(X, y)
38 changes: 38 additions & 0 deletions sklearn/feature_selection/tests/test_from_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,22 @@
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.base import BaseEstimator


class NaNTag(BaseEstimator):
def _more_tags(self):
return {'allow_nan': True}


class NoNaNTag(BaseEstimator):
def _more_tags(self):
return {'allow_nan': False}


class NaNTagRandomForest(RandomForestClassifier):
def _more_tags(self):
return {'allow_nan': True}


iris = datasets.load_iris()
data, y = iris.data, iris.target
rng = np.random.RandomState(0)
Expand Down Expand Up @@ -320,3 +336,25 @@ def test_threshold_without_refitting():
# Set a higher threshold to filter out more features.
model.threshold = "1.0 * mean"
assert X_transform.shape[1] > model.transform(data).shape[1]


def test_transform_accepts_nan_inf():
# Test that transform doesn't check for np.inf and np.nan values.
clf = NaNTagRandomForest(n_estimators=100, random_state=0)

model = SelectFromModel(estimator=clf)
model.fit(data, y)

data[0] = np.NaN
data[1] = np.Inf
model.transform(data)


def test_allow_nan_tag_comes_from_estimator():
allow_nan_est = NaNTag()
model = SelectFromModel(estimator=allow_nan_est)
assert model._get_tags()['allow_nan'] is True

no_nan_est = NoNaNTag()
model = SelectFromModel(estimator=no_nan_est)
assert model._get_tags()['allow_nan'] is False
23 changes: 23 additions & 0 deletions sklearn/feature_selection/tests/test_rfe.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Testing Recursive feature elimination
"""

import pytest
import numpy as np
from numpy.testing import assert_array_almost_equal, assert_array_equal
from scipy import sparse
Expand Down Expand Up @@ -369,3 +370,25 @@ def test_rfe_cv_groups():
)
est_groups.fit(X, y, groups=groups)
assert est_groups.n_features_ > 0


@pytest.mark.parametrize("cv", [
None,
5
])
def test_rfe_allow_nan_inf_in_x(cv):
iris = load_iris()
X = iris.data
y = iris.target

# add nan and inf value to X
X[0][0] = np.NaN
X[0][1] = np.Inf

clf = MockClassifier()
if cv is not None:
rfe = RFECV(estimator=clf, cv=cv)
else:
rfe = RFE(estimator=clf)
rfe.fit(X, y)
rfe.transform(X)
12 changes: 12 additions & 0 deletions sklearn/feature_selection/tests/test_variance_threshold.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,15 @@ def test_zero_variance_floating_point_error():
msg = "No feature in X meets the variance threshold 0.00000"
with pytest.raises(ValueError, match=msg):
VarianceThreshold().fit(X)


def test_variance_nan():
arr = np.array(data, dtype=np.float64)
# add single NaN and feature should still be included
arr[0, 0] = np.NaN
# make all values in feature NaN and feature should be rejected
arr[:, 1] = np.NaN

for X in [arr, csr_matrix(arr), csc_matrix(arr), bsr_matrix(arr)]:
sel = VarianceThreshold().fit(X)
assert_array_equal([0, 3, 4], sel.get_support(indices=True))
Loading