-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
DOC improved plot_semi_supervised_newsgroups.py example #31104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
Hi @StefanieSenger . Could you please review the modifications I've done to this example and let me know if we need to change something? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your work, @elhambbi. It's going to upgrade the example a lot!
I especially like the part in the beginning that explains the intend of the example and summarises the approaches that get compared to each other.
I have a few little comments and also it would be nice to make the code more notebook style like and introduce some sub-sections.
Would you mind doing that?
1. Supervised learning using 100% of labeled data (baseline) | ||
- Uses SGDClassifier with TF-IDF features | ||
- Represents the best possible performance with full supervision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bullet points don't get rendered as intended. Can you please fix this?
You can see the rendering by building the documentation locally or by clicking "check the rendered docs" in the CI.
ls_pipeline, X_train, y_train_semi, X_test, y_test | ||
) | ||
|
||
# Create the plot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that instead of having comments like # Create the plot
, # Define colors for each bar
, # Customize plot
, or # Add value labels on bars
, it would be nice to add 1-2 sentences on top of this part explaining what will happen in this next section which would adhere more to the notebook style that we thrive for with the examples. This is a good example to look at, if you need some inspiration, @elhambbi: #26956
Hi @elhambbi, did you have some time to look into this? |
Hi @StefanieSenger. I apologize, I've been too busy lately. I'll work on it over the weekend |
Hi @elhambbi, no rush. I just wanted to know whether this PR is still active. Take your time. |
Hi @StefanieSenger. Sorry, I've been quite busy lately. I have modified the code. Please let me know if it needs further improvement. Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @elhambbi!
I have only got one little suggestion on the notebook style and some nit picks. Overall it looks pretty good to me.
Feel free to do more of the notebook style-like changes, but from my side with having these two blocks I'm pretty happy and would then forward this to a maintainer to have a look.
The example uses the 20 newsgroups dataset, focusing on five categories. | ||
The results demonstrate how semi-supervised methods can achieve better | ||
performance than supervised learning with limited labeled data by | ||
effectively utilizing unlabeled samples. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry, this is really a nit.) We don't need this line.
Hi @StefanieSenger . I made the changes, and the file is updated now. I closed the PR by mistake and then reopened it. The commit history is gone but the code is correct with all the changes we discussed. Thank you |
@elhambbi thanks for the PR. Could you please push a fix to add the missing new line as reported by the linter? This would help get the Continuous Integration to proceed with building the HTML preview of example edited by the PR to make it easier to review. |
|
||
You can adjust the number of categories by giving their names to the dataset | ||
loader or setting them to `None` to get all 20 of them. | ||
1. Supervised learning using 100% of labeled data (baseline) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In wouldn't call this a baseline: if we do not have access to 100% labeled data, it will probably not be possible to reach the performance of this model. I would rather call it a "best case scenario".
- Uses SGDClassifier with TF-IDF features | ||
- Represents the best possible performance with full supervision | ||
|
||
2. Supervised learning using only 20% of labeled data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This on the other hand could be called a "baseline" to compare semi-supervised learning methods to.
|
||
# select a mask of 20% of the train dataset | ||
y_mask = np.random.rand(len(y_train)) < 0.2 | ||
# Evaluate supervised model with 100% of training data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably split each call to eval_and_get_f1
in its own cell by using # %%
delimiters. Then the comments such as # Evaluate supervised model with 100% of training data
should be slightly rephrased as grammatically correct sentences or paragraphs instead of just short inline code comments as @StefanieSenger suggested above.
Thanks for your work, @elhambbi! I see this really taking shape and that a core developer is now reviewing this (not just me as a team member) shows the progress of this PR. (Please: for expectation management, what I wrote in the What comes next? section of #30621 applies here: a PR like that is expected to be thoroughly reviewed and the result will be something a new contributor can be very proud of once it is finished.) I will review more later. Just a little comment on the commit history: it has disappeared because of force-pushing, which we try to avoid at all costs. PRs are easier to review when we can see the commit history and as someone coming back to this PR regularily, I can rely on only looking at the newest changes instead of going through everything from the beginning. (Because I only have so much mental capacity and I jump a lot between very different topics and PRs.) |
Reference Issues/PRs
Towards #30621 and with reference to PR #30882, the code for
plot_semi_supervised_newsgroups.py
is improved.What does this implement/fix? Explain your changes.
Any other comments?
LabelSpreading, as a semi-supervised method, is not better than the fully supervised models in terms of F1 score. Should we keep it or is SelfTraining enough to demonstrate the superiority of semi-supervised methods when having limited data?