Thanks to visit codestin.com
Credit goes to github.com

Skip to content

scale_params for linear models #779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
GaelVaroquaux opened this issue Apr 17, 2012 · 11 comments
Closed

scale_params for linear models #779

GaelVaroquaux opened this issue Apr 17, 2012 · 11 comments

Comments

@GaelVaroquaux
Copy link
Member

As discussed on the mailing list, the way the regularization parameter is scaled in linear models can be fragile to some simple variations of the data, such as when the number of training samples vary. This is the case of libsvm, for which we tried to come up with a rescaling of the C parameter, that ends up being a burden as the resulting API no longer matches closely the libsvm API.

The problem is more general than libsvm, and I propose that an optional 'scale_params' parameter be added to some linear models, to put the regularization parameter in a more adimensional form. In the future, it can be added to other estimators.

For l1-penalized models, the way C should be scaled is fairly natural and given by the KKT conditions, as implemented in svm.l1_min_c (note that to convert the C parameter of SVMs/logreg to alpha in lasso and enet, you have to take something like alpha = n_samples/C). For l2-penalized model, there is no such abrupt change, and I suggest to investigate using the l2 norm of Xy instead of the l_inf (max) in svm.l1_min_c.

Here is the battle plan:

  • Transform @amueller 's gist into a scikit-learn example https://gist.github.com/2354823
  • Implement scale_params for l1 models (logistic regression and lasso), add these models to the example, check that scale_params works
  • Factor out the logic of l1_min_c into a '_penalty_min_heuristic' function that works for l1 and l2 and use it to implement l1_minc and l2_min_c
  • Implement scale_params for l2 models (SVM, ridge), add these models to the example, check that the scaling does work
  • Add a warning in GridSearchCV that check if an estimator has a scale_params attribute and that raises a warning if it is set to False
  • Remove the scale_C parameter from SVMs

This should be it. And I heard that @jaquesgrobler was volunteered to do this :)

@ogrisel
Copy link
Member

ogrisel commented Apr 17, 2012

This affects not just linear models but also SVM with kernels too.

@amueller
Copy link
Member

@GaelVaroquaux I am not so familiar with the working of l1_min_c. Do I understand it correctly that you suggest something more than scaling by n_samples? And that it is not really clear what to do in the l2 case?

@GaelVaroquaux
Copy link
Member Author

@GaelVaroquaux I am not so familiar with the working of l1_min_c.
Do I understand it correctly that you suggest something more than
scaling by n_samples?

Yes, it would also account for the fact that if you scale (in a squared
loss regression setting) X and y by a constant, you should also scale the
l1 penalty (as the loss term scales as the square, and the l1 penalty
only scales linearly.

For l1, what I propose to use, is the maxmimum penalty for which the
model has non-zero coefficients. It is a quite natural scaling, and above
this penalty, the model will be completely sparse, and it is not useful
to do a grid search. It will also scale with the number of samples, and
thus solve our current problem.

And that it is not really clear what to do in the l2 case?

Indeed, it is not as clear.

Gael

@amueller
Copy link
Member

Ok I just reread your mail and I think I understand the idea.

Basically what "scaling" means would depend on the estimator and the setting of the other parameters (like penalty).
I am not sure how general the problem really is. Which parameters should be scaled conditioned on which other parameters? (This might already be unclear for rbf SVMs).

Also, "do nothing" seems a reasonable heuristic for l2 SVMs. So using l1_min_c for l1 and using liblinears method for l2 would be consistent with your proposal?

@GaelVaroquaux
Copy link
Member Author

On Tue, Apr 17, 2012 at 02:31:32PM -0700, Andreas Mueller wrote:

Basically what "scaling" means would depend on the estimator and the
setting of the other parameters (like penalty).

Indeed.

I am not sure how general the problem really is.

It is very general. The question is: how do I go from one problem to
another while keeping as many things constant as possible: if I vary the
number of samples (drawn from the same distribution), how do I need to
vary the penalty? If I scale my features, how do I need to vary the
penalty?

Which parameters should be scaled conditioned on which other
parameters? (This might already be unclear for rbf SVMs).

I suggest that we move slowly, and implement parameter scaling only for
the parameters for which we know it makes sens, and have a good reason to
do it.

Also, "do nothing" seems a reasonable heuristic for l2 SVMs.

I am not sure. It seems to me that if we choose the 'do nothing option,
if I increase the number of samples, I get in the problem that got us
here. Ideally, the way to should the heuristic should be by looking at
the optimality conditions of the optimization problem solved.

So using l1_min_c for l1 and using liblinears method for l2 would
be consistent with your proposal?

I don't get your question.

Gael

@amueller
Copy link
Member

To the "general" again: My question is, how do you decide which variables are the independent and which are the dependent? For number of samples, that may be clear, but if you have two interacting parameters, which one would you keep the same and which one would you vary to fit the new setting?
For rbf-svms: would C depend on gamma or gamma on C?
To me, this looks like trying to be too clever.

About l2: not sure it is possible to observe the same. Maybe if you have a simple problem with loads of noise. In problems I'm usually facing, if I get 100x more data, this does not represent the same distribution any more. I have way to few data points to sample my space any where near dense enough.
Also: the problem doesn't fit in my RAM any more if I have 100x more data. (100 is a factor in C that will probably mess things up in settings I deal with).

My last question was me being confused and a bit provocative as you didn't really say what to do for l2 ;)

You argued with scale invariance before. SVMs are usually not expected to be scale invariant. One could try to scale gamma in rbf SVMs to get that but that would be pretty unexpected behavior to me (and most people used to deal with rbf SVMs). There are heuristics to choose gamma based on all kind of estimations but I wouldn't want this in scikit [edit](by default)[/edit]. And it would be specific to one kernel.

Having more samples would theoretically lead to a linear scaling of C. I guess I could come up with an example that shows that this would be the right thing. But as I said above, I highly doubt that this would help in practice.

@amueller
Copy link
Member

On the other hand, if you just rename "scale_C" to have a more general name and don't attach to much semantics I can live with that - for the l2 case that would mean use "scale_C" (under a different name) and warn.

@mblondel
Copy link
Member

We need to check whether scaling based on l1_min_C works or not with Andy's example. One problem that I see with using it is that it will be more difficult for users to decide the values of C they want to grid search.

@amueller
Copy link
Member

Rescheduling for 0.13

@jaquesgrobler
Copy link
Member

Ping on this one/// Where exactly are we regarding this issue? I want to add this guy to my todo-list

@amueller amueller modified the milestone: 0.15 Sep 29, 2016
@amueller
Copy link
Member

@GaelVaroquaux we decided to punt on this, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants