Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

AlexanderFabisch
Copy link
Member

... the second part of #2584.

Here is an example:

import numpy as np
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.learning_curve import validation_curve
import matplotlib.pyplot as plt

digits = load_digits()
X, y = digits.data, digits.target

param_range = np.logspace(-5, -1, 5)
train_scores, test_scores = validation_curve(
    SVC(), X, y, param_name="gamma", param_range=param_range,
    cv=10, scoring="accuracy", n_jobs=4)

plt.semilogx(param_range, train_scores)
plt.semilogx(param_range, test_scores)
plt.show()

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling d081597 on AlexanderFabisch:validation_curves into bf1635d on scikit-learn:master.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a Returns section here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, param_range should probably be an explicit dict rather than **kwargs.

For example, what if you want to do a learning curve for an estimator which has a parameter called cv? As written, you'd get a "duplicate keyword argument" error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to have param_name and param_range passed explicitly. I think I would favor this.

@jakevdp
Copy link
Member

jakevdp commented Jan 20, 2014

Looks pretty good in general! See my inline comments.

@AlexanderFabisch
Copy link
Member Author

Thanks for the feedback.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 9f117d4 on AlexanderFabisch:validation_curves into bf1635d on scikit-learn:master.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should inherit from BaseEstimator, so that the routines are tested using an actual estimator framework. Is there a reason not to do that?

@AlexanderFabisch
Copy link
Member Author

Maybe we should then think about extending GridSearchCV to return train_scores? It would be just an additional flag because _fit_and_score can return train_scores now. We could still provide validation_curve as a convenience function that uses GridSearchCV to generate the result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this inherits from BaseEstimator, then get_params and set_params shouldn't have to be re-defined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And same for MockImprovingEstimator above, I think...

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) when pulling 476f8fe on AlexanderFabisch:validation_curves into bf1635d on scikit-learn:master.

@jnothman
Copy link
Member

Maybe we should then think about extending GridSearchCV to return train_scores? It would be just an additional flag because _fit_and_score can return train_scores now. We could still provide validation_curve as a convenience function that uses GridSearchCV to generate the result.

See #1742 where @amueller returns training scores and times from grid search, as wished by @ogrisel. I guess it's a matter of balancing the cleanliness of the API and only making a complicated Swiss Army Knife of grid search if that's what's most useful. At the moment we don't have an extensible way to output additional results from grid search.

In any case, even if it just calls grid search, there may be benefit of having a single-purpose utility like validation_curve.

@AlexanderFabisch
Copy link
Member Author

OK, classes and functions should do what their names suggest. GridSearchCV should do a grid search for the best parameters and nothing else. :)

@jakevdp
Copy link
Member

jakevdp commented Jan 21, 2014

I feel good about this PR. I think the possibility of extending GridSearch is interesting, but either way it is valuable to have this utility routine to compute validation scores.

The one last thing that would be helpful would be do have an example, and also add some narrative documentation about model selection using this PR and the learning curves (I imagine something very similar to the lecture I gave in my class last quarter -- see "Overfitting, Underfitting, and Model Selection" at that link).

@AlexanderFabisch, is that something you want to take on? If not, I can plan to do it once this PR is merged.

@AlexanderFabisch
Copy link
Member Author

I could do that. Where should I put that documentation? Is it enough to add an example and link that in the docstring?

@jakevdp
Copy link
Member

jakevdp commented Jan 21, 2014

I think that adding narrative docs (i.e. website/userguide) would be best: as an example, this page has its source in this file. It would probably be best to add a new doc file, doc/modules/learning_curve.rst, with explanation and examples of the new utilities in this module.

If you want to take this on, that would be great! Feel free to ping me if you come across any issues with the process of creating documentation.

@AlexanderFabisch
Copy link
Member Author

Thanks for the tips. Probably you should again do a thorough proofreading in the end since you have probably written more texts about this topic than I did.

@jakevdp
Copy link
Member

jakevdp commented Jan 21, 2014

Thanks for the tips. Probably you should again do a thorough proofreading in the end since you have probably written more texts about this topic than I did.

Definitely will do. Since it's all related, we may as well just keep working within this PR.

@AlexanderFabisch
Copy link
Member Author

Here is a first draft. I have 2 problems at the moment:

  1. The link to validation_curve in doc/modules/learning_curve.rst does not work.
  2. The heading of the example plot_validation_curve.py is not shown correctly in doc/modules/learning_curve.rst.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can embed figures from the example plot here. Take a look at some of the other module documentation to see how this is done

@jakevdp
Copy link
Member

jakevdp commented Jan 22, 2014

This looks like a good start!

I think the reason the link is not working is because you need to label the document somehow... but I always need to re-learn sphinx every time I use it!

One structural suggestion: when I'm presenting this material, I usually do validation curves first, and move-on to learning curves after that. I find that it flows better that way.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 81faacc on AlexanderFabisch:validation_curves into 080887e on scikit-learn:master.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of text. I think it would be great to include some examples here showing over-fitting and under-fitting. My favorite demonstration of this is to use a 1D plot with a polynomial regression (we could use the new PolynomialFeatures preprocessor for this). A low-degree polynomial has high bias, and will under-fit; a high-degree polynomial has high variance, and will over-fit the data (I used an example like this in my astronomy text: figure here).

@AlexanderFabisch
Copy link
Member Author

I have added a similar example.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling e3253dd on AlexanderFabisch:validation_curves into 080887e on scikit-learn:master.

@jakevdp
Copy link
Member

jakevdp commented Jan 23, 2014

I have added a similar example.

That's really nice! Have you built the documentation to make sure this renders and is linked correctly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less -> fewer

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling b68bd36 on AlexanderFabisch:validation_curves into baa3c7e on scikit-learn:master.

@AlexanderFabisch
Copy link
Member Author

I think it is much easier to read now (and shorter).

@jakevdp
Copy link
Member

jakevdp commented Feb 2, 2014

I think this is great!
+1 for merge on my part. I'd like to get another dev to look at this, though: @jnothman or @ogrisel?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the above.

The docs do not build. Possibly related to the typo above. Please check that 'make html' does run and produce a sensible doc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually my docs have been built. I will try to fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually my docs have been built. I will try to fix this.

OK. It crashes on my box :$

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the typo. What is the error message?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still doesn't build:

I fixed the typo. What is the error message?

reading sources... [ 35%] modules/classes                                       
Exception occurred:
  File
"/home/varoquau/dev/scikit-learn/doc/sphinxext/numpy_ext/docscrape.py",
line 212, in parse_item_name
    raise ValueError("%s is not a item name" % text)
ValueError: :ref:`examples/plot_polynomial_regression.py is not a item
name
The full traceback has been saved in /tmp/sphinx-err-RdTJcn.log, if you
want to report the issue to the developers.

I am not convinced that you can put rst syntax in a numpy docstring
"seealso" block. I wonder if it is not only symbol names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. At least I learned something. :)

In addition, I fixed the attributes section of PolynomialFeatures.

@GaelVaroquaux
Copy link
Member

No other remarks than the 2 above.

It is possible to plot confidence intervals with fill_between
@AlexanderFabisch
Copy link
Member Author

learning_curve and validation_curve both return scores of all cv folds now.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 4a4bfc5 on AlexanderFabisch:validation_curves into baa3c7e on scikit-learn:master.

@AlexanderFabisch
Copy link
Member Author

@GaelVaroquaux @jakevdp I think this is (finally) finished now. :)

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 2fb8bc7 on AlexanderFabisch:validation_curves into baa3c7e on scikit-learn:master.

@jakevdp
Copy link
Member

jakevdp commented Feb 5, 2014

Hey - I just built the documentation, and it looks good.

One thing I would recommend changing: in the learning curve plots, it's confusing that the shaded error is blue for both lines. It should be changed to match the color of the line it surrounds.

@AlexanderFabisch
Copy link
Member Author

Good idea

@jakevdp
Copy link
Member

jakevdp commented Feb 5, 2014

Looks good to me. I'll wait until the Travis CI build is finished, and then merge. Great work on this!

@GaelVaroquaux
Copy link
Member

Great! Thanks a lot @AlexanderFabisch !

jakevdp added a commit that referenced this pull request Feb 5, 2014
@jakevdp jakevdp merged commit 5319994 into scikit-learn:master Feb 5, 2014
@AlexanderFabisch
Copy link
Member Author

Yeah!

That was a lot more than I thought I would do before I started. :)

Thanks for the great reviews.

@mblondel
Copy link
Member

mblondel commented Feb 6, 2014

There was already a polynomial regression example:
http://scikit-learn.org/stable/auto_examples/linear_model/plot_polynomial_interpolation.html#example-linear-model-plot-polynomial-interpolation-py

I think the new one is better so maybe we can remove the old one.

@mblondel
Copy link
Member

mblondel commented Feb 6, 2014

I agree with @jnothman that we're starting to have a myriad of little utility functions that do almost the same thing.

cross_validation_report from PR #2759 and validation_curve from this PR are almost the same. I think we could merge them if we allowed to return a n_scorers x n_params x n_folds array as well as the training times.

@AlexanderFabisch
Copy link
Member Author

I opened an issue for the duplicate example.

I don't know if validation_curve should always return scores from multiple scorers. The additional dimension could be annoying for some use cases. validation_curve should be used for plotting and cross_validation_report is a more general utility. I agree that we should minimize the number of those functions. But I don't know if it is a good idea to merge these functions.

@mblondel
Copy link
Member

mblondel commented Feb 6, 2014

The additional dimension could be annoying for some use cases

I agree. This is why in PR #2759, the additional axis is "flattened" when there is only one scorer:
https://github.com/mblondel/scikit-learn/blob/multiple_grid_search/sklearn/cross_validation.py#L1179

I think we can do the same for the parameter axis.

@jnothman
Copy link
Member

jnothman commented Feb 6, 2014

This is why in PR #2759#2759,
the additional axis is "flattened" when there is only one scorer... I think
we can do the same for the parameter axis.

I had a similar idea in one version of my CVScorer, although it changed a
bit when Olivier suggested we might want all sorts of information from each
fold and wouldn't require that all succeed without error, etc... In a later
refactor attempt, I have an interface for extensible cross-validation
results at
https://github.com/scikit-learn/scikit-learn/pull/2079/files#diff-d1a79b4a1b5f91f6b6c1829b212fca53R1251,
where you can either call the scorer for a single parameter setting's
results, or search across multiple settings (i.e. an abstraction for
validation_curve and BaseSearchCV._fit).

On 6 February 2014 19:26, Mathieu Blondel [email protected] wrote:

The additional dimension could be annoying for some use cases

I agree. This is why in PR #2759#2759,
the additional axis is "flattened" when there is only one scorer:

https://github.com/mblondel/scikit-learn/blob/multiple_grid_search/sklearn/cross_validation.py#L1179

I think we can do the same for the parameter axis.

Reply to this email directly or view it on GitHubhttps://github.com//pull/2765#issuecomment-34301649
.

@jakevdp
Copy link
Member

jakevdp commented Feb 6, 2014

There was already a polynomial regression example:
http://scikit-learn.org/stable/auto_examples/linear_model/plot_polynomial_interpolation.html#example-linear-model-plot-polynomial-interpolation-py

I think the new one is better so maybe we can remove the old one.

I disagree. I think the two examples emphasize different things and are used within the narrative docs in different ways, so we should keep them both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants