-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
[MRG+1] FIX: warns when invalid n_components in LinearDiscriminantAnalysis #11526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] FIX: warns when invalid n_components in LinearDiscriminantAnalysis #11526
Conversation
|
I think future warning sounds good. |
amueller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks good, but the boundary case is unclear in the doc and comments.
sklearn/discriminant_analysis.py
Outdated
| n_components : int, optional | ||
| Number of components (< n_classes - 1) for dimensionality reduction. | ||
| n_components : int, optional (default=None) | ||
| Number of components (< min(n_classes - 1, n_features)) for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<=, right? = is the default after all...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, I forgot about the boundary case
sklearn/discriminant_analysis.py
Outdated
| self.n_components) | ||
| if self.n_components > max_components: | ||
| warnings.warn( | ||
| "n_components cannot be superior to min(n_features, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say "larger than" not "superior". I'm not a native English speaker and the usage strikes me as odd. Even if it's correct, we have many users that are not native English speakers and will be thrown off by it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was a french inspired mistake :p
|
|
||
| if self.n_components is None: | ||
| self._max_components = len(self.classes_) - 1 | ||
| self._max_components = max_components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the private variable should be called _n_components, not _max_components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, however maybe it could be justified to call it _max_components for the case where some inputs of LDA would be collinear ? indeed the scalings_ would be truncated and have less components than _n_components (see issue #11528)
| @pytest.mark.parametrize('n_features', [3, 5]) | ||
| @pytest.mark.parametrize('n_classes', [5, 3]) | ||
| def test_lda_dimension_warning(n_classes, n_features): | ||
| RNG = check_random_state(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lowercase please. upper case is reserved for module level constants, right? (yes, X violates that, I know).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's right, will do
|
|
||
| for n_components in [max_components + 1, | ||
| max(n_features, n_classes - 1) + 1]: | ||
| # if n_components < min(n_classes - 1, n_features), raise warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sorry, typo
| max_components = min(n_features, n_classes - 1) | ||
|
|
||
| for n_components in [max_components - 1, None, max_components]: | ||
| # if n_components < min(n_classes - 1, n_features), no warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<=?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will do
| assert_no_warnings(lda.fit, X, y) | ||
|
|
||
| for n_components in [max_components + 1, | ||
| max(n_features, n_classes - 1) + 1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the second one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this one indeed, it's just that since I test something just one unit higher than max_components, I thought I could test something that is higher than both n_features and n_classes - 1 to ensure the test works for any value of n_components
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good for me. Maybe a small comment that explains this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Alright, I'll include one |
- fix doc for boundary case using including unequalities - fix typos - fix style conventions
and fixes test warning assertion
sklearn/discriminant_analysis.py
Outdated
| "n_classes - 1) = min(%d, %d - 1) = %d components." | ||
| % (X.shape[1], len(self.classes_), max_components), | ||
| ChangedBehaviorWarning) | ||
| future_msg = ("In version 0.22, invalid values for " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be useful to say in a few words what invalid means.
GaelVaroquaux
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the two minor comments that I made, this is good for me.
| assert_no_warnings(lda.fit, X, y) | ||
|
|
||
| for n_components in [max_components + 1, | ||
| max(n_features, n_classes - 1) + 1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good for me. Maybe a small comment that explains this.
TomDLT
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but you have a conflict.
You also need to add a whatsnew entry.
Thanks !
…alysis # Conflicts: # sklearn/discriminant_analysis.py # sklearn/tests/test_discriminant_analysis.py
|
Sorry for the late reply. I resolved the conflict, added a what's new entry, and changed the |
|
Thanks ! |
…ysis (scikit-learn#11526)" This reverts commit 829d7bb.
…ysis (scikit-learn#11526)" This reverts commit 829d7bb.
Reference Issues/PRs
Fixes #10048.
Fixes #8956. (The second dimension of scalings will always be thresholded (not only for svd, (see #8956 (comment))))
What does this implement/fix? Explain your changes.
This PR:
ChangedBehaviourWarningwhen the user setsn_components>min(n_features, n_classes - 1). In this case it sets themax_features(the number of first components to take) tomin(n_features, n_classes - 1)(and this way it doesn't take then_componentsinto account anymore). It does not throws an error like PCA not to break user code (cf comment: LinearDiscriminantAnalysis doesn't reduce dimensionality during prediction #6355 (comment)). I should maybe provide aFutureWarningand throw an error in the future ?n_components < min(n_features, n_classes - 1)(and not justn_components<n_classes - 1)I did not check explicitly the dimension (just the presence/absence of warnings) because it can still happen that the dimension is unexpected if input points are colinear. I was thinking to tackle this in another PR (raise a warning in that case and/or return the whole
scalings_(including zeros) without truncation) (see #11528)TODO:
FutureWarning