Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,9 @@ Enhancements
A ``TypeError`` will be raised for any other kwargs. :issue:`8028`
by :user:`Alexander Booth <alexandercbooth>`.

- Added ability to use sparse matrices in :func:`feature_selection.f_regression`
with ``center=True``. :issue:`8065` by :user:`Daniel LeJeune <acadiansith>`.

Bug fixes
.........

Expand Down
6 changes: 6 additions & 0 deletions sklearn/feature_selection/tests/test_feature_select.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,12 @@ def test_f_regression():
assert_true((pv[:5] < 0.05).all())
assert_true((pv[5:] > 1.e-4).all())

# with centering, compare with sparse
F, pv = f_regression(X, y, center=True)
F_sparse, pv_sparse = f_regression(sparse.csr_matrix(X), y, center=True)
assert_array_almost_equal(F_sparse, F)
assert_array_almost_equal(pv_sparse, pv)

# again without centering, compare with sparse
F, pv = f_regression(X, y, center=False)
F_sparse, pv_sparse = f_regression(sparse.csr_matrix(X), y, center=False)
Expand Down
20 changes: 15 additions & 5 deletions sklearn/feature_selection/univariate_selection.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,17 +266,27 @@ def f_regression(X, y, center=True):
f_classif: ANOVA F-value between label/feature for classification tasks.
chi2: Chi-squared stats of non-negative features for classification tasks.
"""
if issparse(X) and center:
raise ValueError("center=True only allowed for dense data")
X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64)
n_samples = X.shape[0]

# compute centered values
# note that E[(x - mean(x))*(y - mean(y))] = E[x*(y - mean(y))], so we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is applicable only when y is mean centered correct? In which case it would be E[x*y]? (Sorry if I'm misunderstanding)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is applicable even when y is not centered.
Yet you are right, here we compute E[x*(y - mean(y)) by first centering y and then computing E[x*y].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes sorry my math is a bit rusty... Thanks heaps for the clarification (online and offline)!

# need not center X
if center:
y = y - np.mean(y)
X = X.copy('F') # faster in fortran
X -= X.mean(axis=0)
if issparse(X):
X_means = X.mean(axis=0).getA1()
else:
X_means = X.mean(axis=0)
# compute the scaled standard deviations via moments
X_norms = np.sqrt(row_norms(X.T, squared=True) -
n_samples * X_means ** 2)
else:
X_norms = row_norms(X.T)

# compute the correlation
corr = safe_sparse_dot(y, X)
corr /= row_norms(X.T)
corr /= X_norms
corr /= norm(y)

# convert to p-value
Expand Down