Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] Incorrent implementation of noise_variance_ in PCA._fit_truncated #9108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Aug 6, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,9 @@ Decomposition, manifold learning and clustering
``singular_values_``, like in :class:`decomposition.IncrementalPCA`.
:issue:`7685` by :user:`Tommy Löfstedt <tomlof>`

- Fixed the implementation of noise_variance_ in :class:`decomposition.PCA`.
:issue:`9108` by `Hanmin Qin <https://github.com/qinhanmin2014>`_.

- :class:`decomposition.NMF` now faster when ``beta_loss=0``.
:issue:`9277` by :user:`hongkahjun`.

Expand Down
9 changes: 8 additions & 1 deletion sklearn/decomposition/pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,9 @@ class PCA(_BasePCA):
explained_variance_ : array, shape (n_components,)
The amount of variance explained by each of the selected components.

Equal to n_components largest eigenvalues
of the covariance matrix of X.

.. versionadded:: 0.18

explained_variance_ratio_ : array, shape (n_components,)
Expand Down Expand Up @@ -232,6 +235,9 @@ class PCA(_BasePCA):
http://www.miketipping.com/papers/met-mppca.pdf. It is required to
computed the estimated data covariance and score samples.

Equal to the average of (min(n_features, n_samples) - n_components)
smallest eigenvalues of the covariance matrix of X.

References
----------
For n_components == 'mle', this class uses the method of `Thomas P. Minka:
Expand Down Expand Up @@ -494,9 +500,10 @@ def _fit_truncated(self, X, n_components, svd_solver):
self.explained_variance_ratio_ = \
self.explained_variance_ / total_var.sum()
self.singular_values_ = S.copy() # Store the singular values.
if self.n_components_ < n_features:
if self.n_components_ < min(n_features, n_samples):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test to make sure that noise_variance_ is 0 when either n_components > n_features or n_components > n_samples?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve Thanks. Will add in a few day.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just pushed a change to do this.

To be honest I think we should look at the validation code that validate that validate n_components vs n_samples and n_features and use a helper function where appropriate. Also in incremental_pca.py as noted in https://github.com/scikit-learn/scikit-learn/pull/9303/files#r131103304.

self.noise_variance_ = (total_var.sum() -
self.explained_variance_.sum())
self.noise_variance_ /= min(n_features, n_samples) - n_components
else:
self.noise_variance_ = 0.

Expand Down
44 changes: 44 additions & 0 deletions sklearn/decomposition/tests/test_pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -529,6 +529,50 @@ def test_pca_score3():
assert_true(ll.argmax() == 1)


def test_pca_score_with_different_solvers():
digits = datasets.load_digits()
X_digits = digits.data

pca_dict = {svd_solver: PCA(n_components=30, svd_solver=svd_solver,
random_state=0)
for svd_solver in solver_list}

for pca in pca_dict.values():
pca.fit(X_digits)
# Sanity check for the noise_variance_. For more details see
# https://github.com/scikit-learn/scikit-learn/issues/7568
# https://github.com/scikit-learn/scikit-learn/issues/8541
# https://github.com/scikit-learn/scikit-learn/issues/8544
assert np.all((pca.explained_variance_ - pca.noise_variance_) >= 0)

# Compare scores with different svd_solvers
score_dict = {svd_solver: pca.score(X_digits)
for svd_solver, pca in pca_dict.items()}
assert_almost_equal(score_dict['full'], score_dict['arpack'])
assert_almost_equal(score_dict['full'], score_dict['randomized'],
decimal=3)


def test_pca_zero_noise_variance_edge_cases():
# ensure that noise_variance_ is 0 in edge cases
# when n_components == min(n_samples, n_features)
n, p = 100, 3

rng = np.random.RandomState(0)
X = rng.randn(n, p) * .1 + np.array([3, 4, 5])
# arpack raises ValueError for n_components == min(n_samples,
# n_features)
svd_solvers = ['full', 'randomized']

for svd_solver in svd_solvers:
pca = PCA(svd_solver=svd_solver, n_components=p)
pca.fit(X)
assert pca.noise_variance_ == 0

pca.fit(X.T)
assert pca.noise_variance_ == 0


def test_svd_solver_auto():
rng = np.random.RandomState(0)
X = rng.uniform(size=(1000, 50))
Expand Down