-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG]: Use coordinate_descent_gram when precompute is True | auto #3220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also I think there is a slight mistake in the docs in cd_fast.
Should there be an extra |
tol, positive) | ||
else: | ||
model = cd_fast.enet_coordinate_descent( | ||
coef_, l1_reg, l2_reg, X, y, max_iter, tol, positive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
do you confirm a speed up with n_samples >> n_features?
this term is constant so we don't care |
@agramfort It seems to slow down for me.
|
this is weird although possible. Is the dual gap the same at the end? what if n_samples is even bigger and n_features smaller? |
@agramfort My laptop does give weird results sometimes, but I've tested it multiple times. Would you be able to check on your machine? I'll test the remaining cases. |
I've changed the default arguments of precompute, as based on the benchmarks run on the rackspace that @ogrisel gave me
@agramfort Please merge this, if you have no objections. |
I confirm that 'auto' is not doing the right thing when the model is trained with a single alpha. The overhead of computation of the gram matrix kills the benefit of fitting using the gram. now is the conclusion still true when y is 2d with many targets are passed? the what's new page API section will have to be udpated if we change the default arguments. |
@agramfort I've updated the whats_new page.
The gram_coordinate_descent doesn't work for 2d y, I've also tested it for Lasso and LassoCV and it does slow down. |
By the way, the Error in Travis has nothing to do with this PR. Seems to be a timeout. |
@agramfort The only time I think it has a really really slight advantage is when cv=3 or 4, but it is only very little in the order of 0.0x seconds. When cv is more, since we need to calculate the Gram repeatedly for multiple folds, the advantage is again lost. Do you have any specific case, that you want me to bench? |
No, it is caused by a failing doctest that needs to take the change of this PR into account: https://travis-ci.org/scikit-learn/scikit-learn/jobs/26520998#L5635 |
In #3220 (comment), the following line should have caused a value error:
|
Running this branch on my box:
I used So Gram pre-computation seems to be benefiting the CV variant while not benefiting the original model with fixed alpha. This is rather confusing to me.
For wide problems (
I get the similar results with the CV variant:
Both models find the same optimal value for |
@ogrisel I get slightly varied results with 4 cores.
|
@ogrisel Both in the Rackspace cloud, on my box and from your benches, we can be convinced that (please correct me if I'm wrong)
Whatever the default may be, I would like to get this PR merged quick so that I can continue work on the other PR. |
The best strategy is probably data dependent, but what those experiments says is that the current heuristic implemented when Maybe @agramfort or @mblondel could suggest ideas? In my opinion Maybe you can try to check whether You can also try to see the impact of correlated features with data generated with |
Also please try to address: #3220 (comment) |
And we can separate the two issues currently addressed in this PR:
Item 1 should not be controversial. Item 2 probably requires more investigations. |
@ogrisel On a side note, is it possible that I'm seeing a drastic slowdown as compared to your benchmarks because of the way I installed scikit-learn. I installed the dependencies.
and then just did |
I tried to see if
Looks like that is indeed the case, except in the case of @ogrisel Is there some better test than simply |
@jaidevd @ogrisel Thanks. I'm getting simlar results. @agramfort There seems to be just a small margin of speed gain in the case when n_samples >> n_features. What more can we do get this verified? |
No. You could rebuild atlas to tune it to your architecture (e.g. see: http://danielnouri.org/notes/2012/12/19/libblas-and-liblapack-issues-and-speed,-with-scipy-and-ubuntu/ ) but it's more likely that the absolute speed difference between our setups is explained by hardware (e.g. size of the CPU caches) rather than software in this case. But you should not focus on absolute perf numbers but rather relative performance between method on the same hardware.
timeit is fine. We just need to check that the standard deviation across run is low enough. If not it's worth benchmarking on larger problems. |
I played a bit more with noisy data generated with |
The build passes now. |
@MechCoder can you please address #3220 (comment) ? Unexpected values for the |
a] Raise ValueError for invalid precompute b] Remove precompute for MultiTask ENet/LassoCV
@ogrisel Fixed. I also removed precompute from MultiTaskElasticNet / Lasso CV since it is not being used. |
I am not that happy: I don't think we should keep an "auto" mode that is never useful and not used by default: I would rather deprecate it explicitly and add tests to check that the deprecation warnings work. |
Why was the In any case we cannot change the public API (removing parameters) without going through a deprecation cycle. |
The initial goal of We should respect that contract. If the In any case we should not silently change the behavior of |
Also the number of targets might have an impact on whether or not |
MultiTaskElasticNet and LassoCV use a different objective function that just ElasticNet and LassoCV.
Is this true, even if it has not been a part of a public release? These were added by me recently.
Err yes I had meant that. Sorry for being ambiguous. I have updated my PR to, Use gram variant when precompute="auto" and n_samples > n_features or when precompute is True.
As mentioned before, ElasticNet and Lasso CV raise errors for multiple targets. If we need to fit multiple targets, we either need to do
and then use these individually, or directly use MultiTaskElasticNet or LassoCV that does not have a Gram variant. By the way, I'm already much behind according to my GSoC timeline. Is that ok? |
On Wed, Jun 04, 2014 at 07:48:52AM -0700, Manoj Kumar wrote:
If it hasn't been released, it's not a problem.
Yes, it's not the end of the world, but I agree that we need to keep in |
We won't hurry any merge to master because of the GSoC timeline. Especially as we are about to cut the 0.15 branch. I need to find more time to review those changes in deeper details but I don't have the bandwidth to do so now unfortunately. |
@MechCoder could you please split this PR into independent PRs for:
It seems to me that those 3 changes are independent of one another. I am not satisfied of the current state of item 1. so it will likely take longer to merge while the other 2 items should be less controversial as they are seemingly bugfixes. |
Thanks! |
This PR does the following
1] Bench to show that precompute="auto" offers very slight advantage.
2] Remove precompute from MultiTaskElasticNet/Lasso CV
3] Use gram variant when precompute="auto" or True