-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
DOC Revisit SVM C scaling example #25115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
examples/svm/plot_svm_scale_c.py
Outdated
# Now, we can define a linear SVC with the `l1` penalty. | ||
# L1-penalty case | ||
# --------------- | ||
# In the L1 case, theory says that prediction consistency (i.e. that under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While updating this example, have you came across a reference to the "theory says ..." claim?
(It would help resolve #4657)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe Theorem 5.1. in https://arxiv.org/pdf/0801.1095.pdf ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe Theorem 5.1. in arxiv.org/pdf/0801.1095.pdf ?
That theorem establishes an approximate equivalence between the Lasso and the Dantzig selector. I don't think that this theorem is related to the claim that "it is not possible for the learned estimator to predict as well as model knowing the true distribution because of the bias of the L1" simply because it is not true that the L1 norm always introduces bias. I really think we should remove such claim.
ok I played a bit with this it. I agree with @glemaitre that if you don't scale the ramp up is not aligned in the L1 case yet the maximum is better aligned. See: now if you use rescale in the L1 case by \sqrt(1/n_samples) then you get: which is even more aligned. the reason I tried this is that for the Lasso asymptotic theory says that \lambda should scale with 1 / sqrt(n_samples). See eg thm 3 in https://arxiv.org/pdf/1402.1700.pdf or in https://arxiv.org/pdf/0801.1095.pdf where the regularization parameter r is always assumed to be proportional to 1 / sqrt(n_samples). What I would suggest is just to say that yes scaling the C in the L1 case aligns the ramp up but the peak is what matters and the current behavior with is no scaling is pretty OK when it comes to aligning the peaks. my 2c |
Thanks @agramfort. We can change the example accordingly with a better narration :). |
❤
… Message ID: ***@***.***>
|
examples/svm/plot_svm_scale_c.py
Outdated
# Now, we can define a linear SVC with the `l1` penalty. | ||
# L1-penalty case | ||
# --------------- | ||
# In the L1 case, theory says that prediction consistency (i.e. that under given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we mention "theory says", I think we should refer to the article cited by Alex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure the claim "theory says" is justified in the cited documents, mostly because "prediction consistency" and "model consistency" aren't standard terms in the machine learning . I do cite the references in current lines 145 to 148, as they are more relevant at that level of the discussion.
I could still try to rephrase this paragraph to avoid such concepts and keep the underlying idea: L1 may set some coefficients to zero, reducing variance/increasing bias even in the limit where the sample size grows to infinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
LGTM. Thanks @ArturoAmorQ |
Co-authored-by: ArturoAmorQ <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: ArturoAmorQ <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: ArturoAmorQ <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: ArturoAmorQ <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]>
Reference Issues/PRs
Follow-up of #21776. See also #779.
What does this implement/fix? Explain your changes.
This example had room for improvement in terms of wording and clarity of scope. Hopefully this PR fixes it.
Any other comments?
This PR removes one out of two synthetic datasets that were present in the previous narrative, meaning that now a sparse dataset is used for demoing both the L1 and L2 penalty.