scale_params for linear models #779

GaelVaroquaux · 2012-04-17T16:15:11Z

As discussed on the mailing list, the way the regularization parameter is scaled in linear models can be fragile to some simple variations of the data, such as when the number of training samples vary. This is the case of libsvm, for which we tried to come up with a rescaling of the C parameter, that ends up being a burden as the resulting API no longer matches closely the libsvm API.

The problem is more general than libsvm, and I propose that an optional 'scale_params' parameter be added to some linear models, to put the regularization parameter in a more adimensional form. In the future, it can be added to other estimators.

For l1-penalized models, the way C should be scaled is fairly natural and given by the KKT conditions, as implemented in svm.l1_min_c (note that to convert the C parameter of SVMs/logreg to alpha in lasso and enet, you have to take something like alpha = n_samples/C). For l2-penalized model, there is no such abrupt change, and I suggest to investigate using the l2 norm of Xy instead of the l_inf (max) in svm.l1_min_c.

Here is the battle plan:

Transform @amueller 's gist into a scikit-learn example https://gist.github.com/2354823
Implement scale_params for l1 models (logistic regression and lasso), add these models to the example, check that scale_params works
Factor out the logic of l1_min_c into a '_penalty_min_heuristic' function that works for l1 and l2 and use it to implement l1_minc and l2_min_c
Implement scale_params for l2 models (SVM, ridge), add these models to the example, check that the scaling does work
Add a warning in GridSearchCV that check if an estimator has a scale_params attribute and that raises a warning if it is set to False
Remove the scale_C parameter from SVMs

This should be it. And I heard that @jaquesgrobler was volunteered to do this :)

ogrisel · 2012-04-17T17:43:52Z

This affects not just linear models but also SVM with kernels too.

amueller · 2012-04-17T21:15:41Z

@GaelVaroquaux I am not so familiar with the working of l1_min_c. Do I understand it correctly that you suggest something more than scaling by n_samples? And that it is not really clear what to do in the l2 case?

GaelVaroquaux · 2012-04-17T21:21:26Z

@GaelVaroquaux I am not so familiar with the working of l1_min_c.
Do I understand it correctly that you suggest something more than
scaling by n_samples?

Yes, it would also account for the fact that if you scale (in a squared
loss regression setting) X and y by a constant, you should also scale the
l1 penalty (as the loss term scales as the square, and the l1 penalty
only scales linearly.

For l1, what I propose to use, is the maxmimum penalty for which the
model has non-zero coefficients. It is a quite natural scaling, and above
this penalty, the model will be completely sparse, and it is not useful
to do a grid search. It will also scale with the number of samples, and
thus solve our current problem.

And that it is not really clear what to do in the l2 case?

Indeed, it is not as clear.

Gael

amueller · 2012-04-17T21:31:31Z

Ok I just reread your mail and I think I understand the idea.

Basically what "scaling" means would depend on the estimator and the setting of the other parameters (like penalty).
I am not sure how general the problem really is. Which parameters should be scaled conditioned on which other parameters? (This might already be unclear for rbf SVMs).

Also, "do nothing" seems a reasonable heuristic for l2 SVMs. So using l1_min_c for l1 and using liblinears method for l2 would be consistent with your proposal?

GaelVaroquaux · 2012-04-17T21:38:55Z

On Tue, Apr 17, 2012 at 02:31:32PM -0700, Andreas Mueller wrote:

Basically what "scaling" means would depend on the estimator and the
setting of the other parameters (like penalty).

Indeed.

I am not sure how general the problem really is.

It is very general. The question is: how do I go from one problem to
another while keeping as many things constant as possible: if I vary the
number of samples (drawn from the same distribution), how do I need to
vary the penalty? If I scale my features, how do I need to vary the
penalty?

Which parameters should be scaled conditioned on which other
parameters? (This might already be unclear for rbf SVMs).

I suggest that we move slowly, and implement parameter scaling only for
the parameters for which we know it makes sens, and have a good reason to
do it.

Also, "do nothing" seems a reasonable heuristic for l2 SVMs.

I am not sure. It seems to me that if we choose the 'do nothing option,
if I increase the number of samples, I get in the problem that got us
here. Ideally, the way to should the heuristic should be by looking at
the optimality conditions of the optimization problem solved.

So using l1_min_c for l1 and using liblinears method for l2 would
be consistent with your proposal?

I don't get your question.

Gael

amueller · 2012-04-17T22:00:33Z

To the "general" again: My question is, how do you decide which variables are the independent and which are the dependent? For number of samples, that may be clear, but if you have two interacting parameters, which one would you keep the same and which one would you vary to fit the new setting?
For rbf-svms: would C depend on gamma or gamma on C?
To me, this looks like trying to be too clever.

About l2: not sure it is possible to observe the same. Maybe if you have a simple problem with loads of noise. In problems I'm usually facing, if I get 100x more data, this does not represent the same distribution any more. I have way to few data points to sample my space any where near dense enough.
Also: the problem doesn't fit in my RAM any more if I have 100x more data. (100 is a factor in C that will probably mess things up in settings I deal with).

My last question was me being confused and a bit provocative as you didn't really say what to do for l2 ;)

You argued with scale invariance before. SVMs are usually not expected to be scale invariant. One could try to scale gamma in rbf SVMs to get that but that would be pretty unexpected behavior to me (and most people used to deal with rbf SVMs). There are heuristics to choose gamma based on all kind of estimations but I wouldn't want this in scikit [edit](by default)[/edit]. And it would be specific to one kernel.

Having more samples would theoretically lead to a linear scaling of C. I guess I could come up with an example that shows that this would be the right thing. But as I said above, I highly doubt that this would help in practice.

amueller · 2012-04-17T22:30:11Z

On the other hand, if you just rename "scale_C" to have a more general name and don't attach to much semantics I can live with that - for the l2 case that would mean use "scale_C" (under a different name) and warn.

mblondel · 2012-04-18T03:36:04Z

We need to check whether scaling based on l1_min_C works or not with Andy's example. One problem that I see with using it is that it will be more difficult for users to decide the values of C they want to grid search.

amueller · 2012-08-26T16:59:28Z

Rescheduling for 0.13

jaquesgrobler · 2013-02-25T15:00:54Z

Ping on this one/// Where exactly are we regarding this issue? I want to add this guy to my todo-list

amueller · 2016-10-27T20:14:37Z

@GaelVaroquaux we decided to punt on this, right?

amueller modified the milestone: 0.15 Sep 29, 2016

agramfort closed this as completed Mar 11, 2018

lucyleeow mentioned this issue Jul 1, 2020

Refactor CalibratedClassifierCV #17803

Closed

ArturoAmorQ mentioned this issue Dec 5, 2022

DOC Revisit SVM C scaling example #25115

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

scale_params for linear models #779

scale_params for linear models #779

GaelVaroquaux commented Apr 17, 2012

ogrisel commented Apr 17, 2012

Uh oh!

amueller commented Apr 17, 2012

Uh oh!

GaelVaroquaux commented Apr 17, 2012

Uh oh!

amueller commented Apr 17, 2012

Uh oh!

GaelVaroquaux commented Apr 17, 2012

Uh oh!

amueller commented Apr 17, 2012

Uh oh!

amueller commented Apr 17, 2012

Uh oh!

mblondel commented Apr 18, 2012

Uh oh!

amueller commented Aug 26, 2012

Uh oh!

jaquesgrobler commented Feb 25, 2013

Uh oh!

amueller commented Oct 27, 2016

Uh oh!

Uh oh!

scale_params for linear models #779

scale_params for linear models #779

Comments

GaelVaroquaux commented Apr 17, 2012

ogrisel commented Apr 17, 2012

Uh oh!

amueller commented Apr 17, 2012

Uh oh!

GaelVaroquaux commented Apr 17, 2012

Uh oh!

amueller commented Apr 17, 2012

Uh oh!

GaelVaroquaux commented Apr 17, 2012

Uh oh!

amueller commented Apr 17, 2012

Uh oh!

amueller commented Apr 17, 2012

Uh oh!

mblondel commented Apr 18, 2012

Uh oh!

amueller commented Aug 26, 2012

Uh oh!

jaquesgrobler commented Feb 25, 2013

Uh oh!

amueller commented Oct 27, 2016

Uh oh!