-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] SPCA centering and fixing scaling issue #11585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@agramfort Rdy for first review (if tests pass) |
sklearn/decomposition/sparse_pca.py
Outdated
@@ -66,6 +66,13 @@ class SparsePCA(BaseEstimator, TransformerMixin): | |||
If None, the random number generator is the RandomState instance used | |||
by `np.random`. | |||
|
|||
normalize_components : boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would show in the docstring that normalize_components
is optional and which is its default.
like here:
scikit-learn/sklearn/ensemble/bagging.py
Line 477 in bcd6ff3
bootstrap_features : boolean, optional (default=False) |
sklearn/decomposition/sparse_pca.py
Outdated
normalize_components : boolean | ||
False : Use a version of Sparse PCA without components normalization | ||
and without data centring. This is likely a bug and eventhough it's | ||
the default for backward compatibility, this should not be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if normalize_components=False
should not be used it but its the default for back compatibility, the usual deprecation cycle should be added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my fault. You already have the deprecation warning. You should add when wash this added and when would it be removed, in the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class SparsePCA(BaseEstimator, TransformerMixin):
"""Sparse Principal Components Analysis (SparsePCA)
...
Parameters
----------
...
normalize_components : boolean, optional (default=False)
False : Use a version of Sparse PCA without ...
.. versionadded:: 0.20
.. deprecated:: 0.22
``normalize_components`` was added and set to ``False`` for bakward compatibility. It would be set to ``True`` from 0.22 onwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I'm not sure if we do this, so feel free to contradict me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, technically we add the parameter, then change its value, then remove it. It's long!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a bug fix we should just change the behavior, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it normalize_components
be removed in version 0.22 and kept to true?
Or would that be a second deprecation cycle, and removed in 0.24?
if removed in 0.22, it should be stated in the deprecation messages.
sklearn/decomposition/sparse_pca.py
Outdated
normalize_components : boolean | ||
False : Use a version of Sparse PCA without components normalization | ||
and without data centring. This is likely a bug and eventhough it's | ||
the default for backward compatibility, this should not be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I'm not sure if we do this, so feel free to contradict me.
random_state=rng, normalize_components=True) | ||
results_train = spca_lars.fit_transform(Y) | ||
results_test = spca_lars.transform(Y[:10]) | ||
assert_array_equal(results_train[0], results_test[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably too stringent (and it's failing on travis). You probably want to use assert_array_almost_equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_allclose in fact
sklearn/decomposition/sparse_pca.py
Outdated
self.mean_ = np.mean(X, axis=0) | ||
X = X - self.mean_ | ||
else: | ||
warnings.warn("normalize_components should be set to True. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be more explicit: "normalize_components=False is a backward-compatible setting that implements a non-standard definition of sparse PCA. This compatibility mode will be removed in 0.22."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the small comments I made (including fixing travis), there needs to be an entry added to whats_new. This is an important change.
I am confused here. Do we actually want to keep the buggy behaviour? |
I am confused here. Do we actually want to keep the buggy behaviour?
Just in case people were relying on it.
I can be talked out of this opinion.
|
sklearn/decomposition/sparse_pca.py
Outdated
@@ -66,6 +66,13 @@ class SparsePCA(BaseEstimator, TransformerMixin): | |||
If None, the random number generator is the RandomState instance used | |||
by `np.random`. | |||
|
|||
normalize_components : boolean | |||
False : Use a version of Sparse PCA without components normalization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use indent
* If False, ...
* If True, ...
sklearn/decomposition/sparse_pca.py
Outdated
@@ -66,6 +66,13 @@ class SparsePCA(BaseEstimator, TransformerMixin): | |||
If None, the random number generator is the RandomState instance used | |||
by `np.random`. | |||
|
|||
normalize_components : boolean | |||
False : Use a version of Sparse PCA without components normalization | |||
and without data centring. This is likely a bug and eventhough it's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
centring -> centering
sklearn/decomposition/sparse_pca.py
Outdated
normalize_components : boolean | ||
False : Use a version of Sparse PCA without components normalization | ||
and without data centring. This is likely a bug and eventhough it's | ||
the default for backward compatibility, this should not be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a bug fix we should just change the behavior, isn't it?
sklearn/decomposition/sparse_pca.py
Outdated
and without data centring. This is likely a bug and eventhough it's | ||
the default for backward compatibility, this should not be used | ||
True : Use a version of Sparse PCA with components normalization | ||
and data centring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
centering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a full stop.
sklearn/decomposition/sparse_pca.py
Outdated
@@ -77,6 +84,11 @@ class SparsePCA(BaseEstimator, TransformerMixin): | |||
n_iter_ : int | |||
Number of iterations run. | |||
|
|||
mean_ : array, shape (n_features,) | |||
Per-feature empirical mean, estimated from the training set. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this blank line
sklearn/decomposition/sparse_pca.py
Outdated
mean_ : array, shape (n_features,) | ||
Per-feature empirical mean, estimated from the training set. | ||
|
||
Equal to `X.mean(axis=0)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backsticks
All right. I’m off for tonight but I’ll take care of it tomorow evening if that’s all right. I’ll take all the comments into account and update the what’s new. Are we still ok with the double deprecation path we chose initialy (0.20 deprecate Default option False, 0.22 deprecate param and change the default to true, 0.24 delete param) ? |
@GaelVaroquaux @glemaitre @massich Is travis supposed to fail on Depreciation warnings ? Because I think excepted some last changes, that's the last step |
Yes, travis is supposed to fail on uncaught deprecation warnings. You should catch them in the test, using warning.simplefilter or @pytest.mark.filterwarnings |
maths are correct (i think) @FollowKenny you need to make sure now that normalize_components is set to True in all examples and documentation pages so no deprecation warning pops up. |
I used git grep "SparsePCA(" and it only showed examples/decomposition/plot_faces_decomposition.py. Am I missing something or is it safe to assume this is the only doc update ? |
@@ -237,5 +237,6 @@ def test_pca_vs_spca(): | |||
def test_spca_deprecation_warning(spca): | |||
rng = np.random.RandomState(0) | |||
Y, _, _ = generate_toy_data(3, 10, (8, 8), random_state=rng) | |||
with pytest.warns(DeprecationWarning, match="normalize_components"): | |||
spca(normalize_components=False).fit(Y) | |||
warn_message ="normalize_components" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not X ? Y is a bit confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed but all the other tests are with Y (and equally confusing) so I kept the local convention. Should I change this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
+1 for merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of nitpicks before merging
sklearn/decomposition/sparse_pca.py
Outdated
@@ -66,6 +66,20 @@ class SparsePCA(BaseEstimator, TransformerMixin): | |||
If None, the random number generator is the RandomState instance used | |||
by `np.random`. | |||
|
|||
normalize_components : boolean, optional (default=False) | |||
- if False, Use a version of Sparse PCA without components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if False, Use -> If False, use
sklearn/decomposition/sparse_pca.py
Outdated
normalization and without data centering. This is likely a bug and | ||
even though it's the default for backward compatibility, | ||
this should not be used. | ||
- if True, Use a version of Sparse PCA with components normalization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if True, Use -> If True, use
sklearn/decomposition/sparse_pca.py
Outdated
and data centering. | ||
|
||
.. versionadded:: 0.20 | ||
.. deprecated:: 0.22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a line in between the version added and deprecated.
sklearn/decomposition/sparse_pca.py
Outdated
|
||
.. versionadded:: 0.20 | ||
.. deprecated:: 0.22 | ||
``normalize_components`` was added and set to ``False`` for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should start the line under the "d" of deprecated of the previous line.
Refer to: http://www.sphinx-doc.org/en/stable/markup/para.html#directive-deprecated for an example
from sklearn.utils import check_random_state | ||
|
||
import pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put this import in line 5 above import numpy as np
sklearn/decomposition/sparse_pca.py
Outdated
@@ -116,6 +136,17 @@ def fit(self, X, y=None): | |||
""" | |||
random_state = check_random_state(self.random_state) | |||
X = check_array(X) | |||
|
|||
if self.normalize_components: | |||
self.mean_ = np.mean(X, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would almost call X.mean(axis=0)
but it is true that we don't support sparse matrix
random_state=0).fit(Y).transform(Y) | ||
random_state=0, | ||
normalize_components=norm_comp)\ | ||
.fit(Y).transform(Y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid that \
. Just create an instance and then call fit_transform on a second line. It is more readable.
finally: | ||
joblib_par.multiprocessing = _mp | ||
else: # we can efficiently use parallelism | ||
U2 = MiniBatchSparsePCA(n_components=3, n_jobs=2, alpha=alpha, | ||
random_state=0).fit(Y).transform(Y) | ||
random_state=0, | ||
normalize_components=norm_comp)\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment than before
model.fit(rng.randn(5, 4)) | ||
assert_array_equal(model.components_, V_init) | ||
if norm_comp: | ||
assert_array_equal(model.components_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be a float comparison. Shall we use assert_allclose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all right but in the else I'm leaving assert_array_equal as this comes from the previous test and it should really be equal as nothing is done on the array in this case.
assert_array_equal(model.components_, | ||
V_init / np.linalg.norm(V_init, axis=1)[:, None]) | ||
else: | ||
assert_array_equal(model.components_, V_init) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be a float comparison. Shall we use assert_allclose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I did not see that you suggested this change too hence my previous comment...
Waiting for the CI to be green |
@FollowKenny Thanks a lot!!!! |
Awesome! @agramfort @GaelVaroquaux @glemaitre @massich Thanks for your help! |
@@ -182,6 +183,14 @@ Decomposition, manifold learning and clustering | |||
This applies to the dictionary and sparse code. | |||
:issue:`6374` by :user:`John Kirkham <jakirkham>`. | |||
|
|||
- :class:`decomposition.SparsePCA` now exposes ``normalize_components``. When | |||
set to True, the train and test data are centered with the train mean | |||
repsectively during the fit phase and the transform phase. This fixes the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
respectively
I'm surprised we do a backward deprecation for a bug. We haven't really done that in the past and I'm -0 on it. @jnothman you have an opinion? |
Sure, if you want to remove the deprecation, it does count as talking me out of it. It's really a change in behavior, but the previous behavior had no statistical meaning whatsoever. |
@FollowKenny : thanks a lot for this work. It was hard and important. |
Reference Issues/PRs
fixes #9394
What does this implement/fix? Explain your changes.
@Andrewww reported a scaling issue with SPCA : the transform scaling was depending on the number of samples. Investigating this issue we found several things that were disturbing :
This PR aims to fix all that. With all the fixes implemented, the scaling issue disappear and the transform from PCA and SPCA(alpha=0, ridge_alpha=0) gives exactly the same results.
Any other comments?
TODO :