Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] SPCA centering and fixing scaling issue #11585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 20, 2018

Conversation

summer-bebop
Copy link
Contributor

@summer-bebop summer-bebop commented Jul 17, 2018

Reference Issues/PRs

fixes #9394

What does this implement/fix? Explain your changes.

@Andrewww reported a scaling issue with SPCA : the transform scaling was depending on the number of samples. Investigating this issue we found several things that were disturbing :

  • In the transform of SPCA the last lines were consisting in a peculiar normalization sequence. These lines were the cause of the issue.
  • The SPCA with alpha=0 and alpha_ridge=0 and the standard PCA were giving inconsistent results.
  • The data were not centred during the fit.
  • The data were not centred either during the transform.
  • The components were not normalized. I'm still not sure that they should be but from what I saw in this paper I'm inclined to say yes but I did not read it thoroughly yet and I did not cross sources.

This PR aims to fix all that. With all the fixes implemented, the scaling issue disappear and the transform from PCA and SPCA(alpha=0, ridge_alpha=0) gives exactly the same results.

Any other comments?

TODO :

  • Decide on the normalization strategy (read more thoroughly)
  • Implement the fixes inside a deprecation path
  • Implement the fixes for MiniBatchSparsePCA
  • Implement scaling test (from the bug report)
  • Implement PCA vs SPCA(0, 0) test (to check the new behaviour is correct)

@summer-bebop
Copy link
Contributor Author

@agramfort Rdy for first review (if tests pass)

@summer-bebop summer-bebop changed the title [WIP] SPCA centering and fixing scaling issue [MRG] SPCA centering and fixing scaling issue Jul 17, 2018
@@ -66,6 +66,13 @@ class SparsePCA(BaseEstimator, TransformerMixin):
If None, the random number generator is the RandomState instance used
by `np.random`.

normalize_components : boolean
Copy link
Contributor

@massich massich Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would show in the docstring that normalize_components is optional and which is its default.
like here:

bootstrap_features : boolean, optional (default=False)

normalize_components : boolean
False : Use a version of Sparse PCA without components normalization
and without data centring. This is likely a bug and eventhough it's
the default for backward compatibility, this should not be used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if normalize_components=False should not be used it but its the default for back compatibility, the usual deprecation cycle should be added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my fault. You already have the deprecation warning. You should add when wash this added and when would it be removed, in the docstring

Copy link
Contributor

@massich massich Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class SparsePCA(BaseEstimator, TransformerMixin):
    """Sparse Principal Components Analysis (SparsePCA)
 
...
    Parameters
    ----------
...
    normalize_components : boolean, optional (default=False) 
        False : Use a version of Sparse PCA without ...

        .. versionadded:: 0.20
        .. deprecated:: 0.22
   		``normalize_components`` was added and set to ``False`` for bakward compatibility. It would be set to ``True`` from 0.22 onwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm not sure if we do this, so feel free to contradict me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, technically we add the parameter, then change its value, then remove it. It's long!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a bug fix we should just change the behavior, isn't it?

Copy link
Contributor

@massich massich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it normalize_components be removed in version 0.22 and kept to true?

Or would that be a second deprecation cycle, and removed in 0.24?

if removed in 0.22, it should be stated in the deprecation messages.

normalize_components : boolean
False : Use a version of Sparse PCA without components normalization
and without data centring. This is likely a bug and eventhough it's
the default for backward compatibility, this should not be used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm not sure if we do this, so feel free to contradict me.

random_state=rng, normalize_components=True)
results_train = spca_lars.fit_transform(Y)
results_test = spca_lars.transform(Y[:10])
assert_array_equal(results_train[0], results_test[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably too stringent (and it's failing on travis). You probably want to use assert_array_almost_equal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_allclose in fact

self.mean_ = np.mean(X, axis=0)
X = X - self.mean_
else:
warnings.warn("normalize_components should be set to True. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more explicit: "normalize_components=False is a backward-compatible setting that implements a non-standard definition of sparse PCA. This compatibility mode will be removed in 0.22."

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the small comments I made (including fixing travis), there needs to be an entry added to whats_new. This is an important change.

@glemaitre
Copy link
Member

I am confused here. Do we actually want to keep the buggy behaviour?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 17, 2018 via email

@@ -66,6 +66,13 @@ class SparsePCA(BaseEstimator, TransformerMixin):
If None, the random number generator is the RandomState instance used
by `np.random`.

normalize_components : boolean
False : Use a version of Sparse PCA without components normalization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use indent

* If False, ...
* If True, ...

@@ -66,6 +66,13 @@ class SparsePCA(BaseEstimator, TransformerMixin):
If None, the random number generator is the RandomState instance used
by `np.random`.

normalize_components : boolean
False : Use a version of Sparse PCA without components normalization
and without data centring. This is likely a bug and eventhough it's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

centring -> centering

normalize_components : boolean
False : Use a version of Sparse PCA without components normalization
and without data centring. This is likely a bug and eventhough it's
the default for backward compatibility, this should not be used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a bug fix we should just change the behavior, isn't it?

and without data centring. This is likely a bug and eventhough it's
the default for backward compatibility, this should not be used
True : Use a version of Sparse PCA with components normalization
and data centring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

centering

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a full stop.

@@ -77,6 +84,11 @@ class SparsePCA(BaseEstimator, TransformerMixin):
n_iter_ : int
Number of iterations run.

mean_ : array, shape (n_features,)
Per-feature empirical mean, estimated from the training set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this blank line

mean_ : array, shape (n_features,)
Per-feature empirical mean, estimated from the training set.

Equal to `X.mean(axis=0)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backsticks

@summer-bebop
Copy link
Contributor Author

All right. I’m off for tonight but I’ll take care of it tomorow evening if that’s all right. I’ll take all the comments into account and update the what’s new. Are we still ok with the double deprecation path we chose initialy (0.20 deprecate Default option False, 0.22 deprecate param and change the default to true, 0.24 delete param) ?

@summer-bebop
Copy link
Contributor Author

@GaelVaroquaux @glemaitre @massich Is travis supposed to fail on Depreciation warnings ? Because I think excepted some last changes, that's the last step

@GaelVaroquaux
Copy link
Member

Yes, travis is supposed to fail on uncaught deprecation warnings. You should catch them in the test, using warning.simplefilter or @pytest.mark.filterwarnings

@agramfort
Copy link
Member

maths are correct (i think)

@FollowKenny you need to make sure now that normalize_components is set to True in all examples and documentation pages so no deprecation warning pops up.

@summer-bebop
Copy link
Contributor Author

I used git grep "SparsePCA(" and it only showed examples/decomposition/plot_faces_decomposition.py. Am I missing something or is it safe to assume this is the only doc update ?
Tests running

@@ -237,5 +237,6 @@ def test_pca_vs_spca():
def test_spca_deprecation_warning(spca):
rng = np.random.RandomState(0)
Y, _, _ = generate_toy_data(3, 10, (8, 8), random_state=rng)
with pytest.warns(DeprecationWarning, match="normalize_components"):
spca(normalize_components=False).fit(Y)
warn_message ="normalize_components"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not X ? Y is a bit confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed but all the other tests are with Y (and equally confusing) so I kept the local convention. Should I change this ?

@GaelVaroquaux GaelVaroquaux changed the title [MRG] SPCA centering and fixing scaling issue [MRG+1] SPCA centering and fixing scaling issue Jul 20, 2018
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

+1 for merge

@GaelVaroquaux GaelVaroquaux added this to the 0.20 milestone Jul 20, 2018
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nitpicks before merging

@@ -66,6 +66,20 @@ class SparsePCA(BaseEstimator, TransformerMixin):
If None, the random number generator is the RandomState instance used
by `np.random`.

normalize_components : boolean, optional (default=False)
- if False, Use a version of Sparse PCA without components
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if False, Use -> If False, use

normalization and without data centering. This is likely a bug and
even though it's the default for backward compatibility,
this should not be used.
- if True, Use a version of Sparse PCA with components normalization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if True, Use -> If True, use

and data centering.

.. versionadded:: 0.20
.. deprecated:: 0.22
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a line in between the version added and deprecated.


.. versionadded:: 0.20
.. deprecated:: 0.22
``normalize_components`` was added and set to ``False`` for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should start the line under the "d" of deprecated of the previous line.
Refer to: http://www.sphinx-doc.org/en/stable/markup/para.html#directive-deprecated for an example

from sklearn.utils import check_random_state

import pytest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this import in line 5 above import numpy as np

@@ -116,6 +136,17 @@ def fit(self, X, y=None):
"""
random_state = check_random_state(self.random_state)
X = check_array(X)

if self.normalize_components:
self.mean_ = np.mean(X, axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would almost call X.mean(axis=0) but it is true that we don't support sparse matrix

random_state=0).fit(Y).transform(Y)
random_state=0,
normalize_components=norm_comp)\
.fit(Y).transform(Y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid that \. Just create an instance and then call fit_transform on a second line. It is more readable.

finally:
joblib_par.multiprocessing = _mp
else: # we can efficiently use parallelism
U2 = MiniBatchSparsePCA(n_components=3, n_jobs=2, alpha=alpha,
random_state=0).fit(Y).transform(Y)
random_state=0,
normalize_components=norm_comp)\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment than before

model.fit(rng.randn(5, 4))
assert_array_equal(model.components_, V_init)
if norm_comp:
assert_array_equal(model.components_,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be a float comparison. Shall we use assert_allclose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all right but in the else I'm leaving assert_array_equal as this comes from the previous test and it should really be equal as nothing is done on the array in this case.

assert_array_equal(model.components_,
V_init / np.linalg.norm(V_init, axis=1)[:, None])
else:
assert_array_equal(model.components_, V_init)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be a float comparison. Shall we use assert_allclose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I did not see that you suggested this change too hence my previous comment...

@glemaitre
Copy link
Member

Waiting for the CI to be green

@glemaitre glemaitre merged commit 6eb1983 into scikit-learn:master Jul 20, 2018
@glemaitre
Copy link
Member

@FollowKenny Thanks a lot!!!!

@summer-bebop
Copy link
Contributor Author

Awesome! @agramfort @GaelVaroquaux @glemaitre @massich Thanks for your help!

@@ -182,6 +183,14 @@ Decomposition, manifold learning and clustering
This applies to the dictionary and sparse code.
:issue:`6374` by :user:`John Kirkham <jakirkham>`.

- :class:`decomposition.SparsePCA` now exposes ``normalize_components``. When
set to True, the train and test data are centered with the train mean
repsectively during the fit phase and the transform phase. This fixes the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

respectively

@amueller
Copy link
Member

I'm surprised we do a backward deprecation for a bug. We haven't really done that in the past and I'm -0 on it. @jnothman you have an opinion?
I'd rather remove the deprecation (does that count as talking you out of it @GaelVaroquaux).
My main argument would be consistency with how we usually do things, which is break behavior if we consider it a bug. I haven't looked at the issue, though from the description it sounds like a bug.

@GaelVaroquaux
Copy link
Member

Sure, if you want to remove the deprecation, it does count as talking me out of it.

It's really a change in behavior, but the previous behavior had no statistical meaning whatsoever.

@GaelVaroquaux
Copy link
Member

@FollowKenny : thanks a lot for this work. It was hard and important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SparsePCA incorrectly scales results in .transform()
7 participants