-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Change dataset for test_classifier_chain_vs_independent_models #9255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/tests/test_multioutput.py
Outdated
X_test = X[2000:, :] | ||
Y_train = Y[:2000, :] | ||
Y_test = Y[2000:, :] | ||
X, y = make_classification(n_samples=1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does generate_multilabel_dataset_with_correlations
not work as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman Thanks. I think we need a random state here to ensure a certain result(In the previous test, the author's purpose is to ensure that different calculation methods get same result so random state is not needed). There are indeed some cases when ClassifierChain get wrose result.
Ordinarily we tend to include a random state at all invocation in the tests to avoid occasional failures... (Although I wish that we had marked all tests that should work for any random state.) Might be cleanest to add a |
@jnothman Thanks. I also think it is good. For this test, the improvement of the model is around 1% both with the original dataset and the new dataset. Considering new dataset relies on random state, without a random state, I'm really worring about test failure. |
@adamklec, could you remind me if there was any special motivation for this test being applied to yeast? We're having trouble with the mldata servers' unreliability, and could do without depending on it for tests to pass. |
It is probably worth it to amend the authors function to avoid effort duplication now/in the future. |
order=np.array([0, 2, 4, 6, 8, 10, | ||
12, 1, 3, 5, 7, 9, | ||
11, 13])) | ||
chain = ClassifierChain(LogisticRegression()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change in orderings?
The specified ordering was chosen on the basis that the test only aimed to
show that there exists some order for which the chain exceeded independent
models in performance. (Perhaps a comment should note this.) If the default
order works, that's fine.
…On 4 Jul 2017 3:13 am, "Jacob Schreiber" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/tests/test_multioutput.py
<#9255 (comment)>
:
>
ovr = OneVsRestClassifier(LogisticRegression())
ovr.fit(X_train, Y_train)
Y_pred_ovr = ovr.predict(X_test)
- chain = ClassifierChain(LogisticRegression(),
- order=np.array([0, 2, 4, 6, 8, 10,
- 12, 1, 3, 5, 7, 9,
- 11, 13]))
+ chain = ClassifierChain(LogisticRegression())
Why the change in orderings?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9255 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yUA8-aiNjQqixQ-I-O9y_QidaDMks5sKSFNgaJpZM4OKNj->
.
|
Thanks. Seems that there's already such comment at the beginning of the function. If there's something more to clarify, please leave a comment. |
LGTM, merging. |
@jnothman Sorry for the delayed response. Your explanation for why I used the yeast dataset is correct. I wanted to write a test that asserted that in the presence of correlated classes ClassifierChains out perform independent models. For some reason I was having difficult doing this using the |
Reference Issue
FIxes #9254
What does this implement/fix? Explain your changes.
I follow the author's method in the previous test to construct a dataset. I don't use the function provided by the author(generate_multilabel_dataset_with_correlations) because we need a random state to ensure a certain result. Sometimes, better model doesn't get better result.
Any other comments?