-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[MRG] Validation curves #2765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Validation curves #2765
Conversation
sklearn/learning_curve.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see a Returns
section here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, param_range
should probably be an explicit dict rather than **kwargs.
For example, what if you want to do a learning curve for an estimator which has a parameter called cv
? As written, you'd get a "duplicate keyword argument" error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to have param_name
and param_range
passed explicitly. I think I would favor this.
Looks pretty good in general! See my inline comments. |
Thanks for the feedback. |
sklearn/tests/test_learning_curve.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should inherit from BaseEstimator
, so that the routines are tested using an actual estimator framework. Is there a reason not to do that?
Maybe we should then think about extending GridSearchCV to return |
sklearn/tests/test_learning_curve.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this inherits from BaseEstimator
, then get_params
and set_params
shouldn't have to be re-defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And same for MockImprovingEstimator
above, I think...
See #1742 where @amueller returns training scores and times from grid search, as wished by @ogrisel. I guess it's a matter of balancing the cleanliness of the API and only making a complicated Swiss Army Knife of grid search if that's what's most useful. At the moment we don't have an extensible way to output additional results from grid search. In any case, even if it just calls grid search, there may be benefit of having a single-purpose utility like |
OK, classes and functions should do what their names suggest. |
I feel good about this PR. I think the possibility of extending The one last thing that would be helpful would be do have an example, and also add some narrative documentation about model selection using this PR and the learning curves (I imagine something very similar to the lecture I gave in my class last quarter -- see "Overfitting, Underfitting, and Model Selection" at that link). @AlexanderFabisch, is that something you want to take on? If not, I can plan to do it once this PR is merged. |
I could do that. Where should I put that documentation? Is it enough to add an example and link that in the docstring? |
I think that adding narrative docs (i.e. website/userguide) would be best: as an example, this page has its source in this file. It would probably be best to add a new doc file, If you want to take this on, that would be great! Feel free to ping me if you come across any issues with the process of creating documentation. |
Thanks for the tips. Probably you should again do a thorough proofreading in the end since you have probably written more texts about this topic than I did. |
Definitely will do. Since it's all related, we may as well just keep working within this PR. |
Here is a first draft. I have 2 problems at the moment:
|
doc/modules/learning_curve.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can embed figures from the example plot here. Take a look at some of the other module documentation to see how this is done
This looks like a good start! I think the reason the link is not working is because you need to label the document somehow... but I always need to re-learn sphinx every time I use it! One structural suggestion: when I'm presenting this material, I usually do validation curves first, and move-on to learning curves after that. I find that it flows better that way. |
doc/modules/learning_curve.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of text. I think it would be great to include some examples here showing over-fitting and under-fitting. My favorite demonstration of this is to use a 1D plot with a polynomial regression (we could use the new PolynomialFeatures
preprocessor for this). A low-degree polynomial has high bias, and will under-fit; a high-degree polynomial has high variance, and will over-fit the data (I used an example like this in my astronomy text: figure here).
I have added a similar example. |
That's really nice! Have you built the documentation to make sure this renders and is linked correctly? |
doc/modules/learning_curve.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less -> fewer
I think it is much easier to read now (and shorter). |
sklearn/preprocessing/data.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the above.
The docs do not build. Possibly related to the typo above. Please check that 'make html' does run and produce a sensible doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually my docs have been built. I will try to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually my docs have been built. I will try to fix this.
OK. It crashes on my box :$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the typo. What is the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still doesn't build:
I fixed the typo. What is the error message?
reading sources... [ 35%] modules/classes Exception occurred: File "/home/varoquau/dev/scikit-learn/doc/sphinxext/numpy_ext/docscrape.py", line 212, in parse_item_name raise ValueError("%s is not a item name" % text) ValueError: :ref:`examples/plot_polynomial_regression.py is not a item name The full traceback has been saved in /tmp/sphinx-err-RdTJcn.log, if you want to report the issue to the developers.
I am not convinced that you can put rst syntax in a numpy docstring
"seealso" block. I wonder if it is not only symbol names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. At least I learned something. :)
In addition, I fixed the attributes section of PolynomialFeatures
.
No other remarks than the 2 above. |
It is possible to plot confidence intervals with fill_between
|
@GaelVaroquaux @jakevdp I think this is (finally) finished now. :) |
Hey - I just built the documentation, and it looks good. One thing I would recommend changing: in the learning curve plots, it's confusing that the shaded error is blue for both lines. It should be changed to match the color of the line it surrounds. |
Good idea |
Looks good to me. I'll wait until the Travis CI build is finished, and then merge. Great work on this! |
Great! Thanks a lot @AlexanderFabisch ! |
Yeah! That was a lot more than I thought I would do before I started. :) Thanks for the great reviews. |
There was already a polynomial regression example: I think the new one is better so maybe we can remove the old one. |
I agree with @jnothman that we're starting to have a myriad of little utility functions that do almost the same thing.
|
I opened an issue for the duplicate example. I don't know if |
I agree. This is why in PR #2759, the additional axis is "flattened" when there is only one scorer: I think we can do the same for the parameter axis. |
I had a similar idea in one version of my CVScorer, although it changed a On 6 February 2014 19:26, Mathieu Blondel [email protected] wrote:
|
I disagree. I think the two examples emphasize different things and are used within the narrative docs in different ways, so we should keep them both. |
... the second part of #2584.
Here is an example: