-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Add number of features used at each step to RFECV.cv_results_ #28670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add number of features used at each step to RFECV.cv_results_ #28670
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @miguelcsilva.
I think it would make things even simpler if RFE
exposed the list of n_features kept at each iteration as a fitted attribute array. This way you wouldn't have to recompute the list with this complicated logic when building the cv_results_
dict. It would also make the computation of the final number of features to select easier. The following
scikit-learn/sklearn/feature_selection/_rfe.py
Lines 761 to 767 in 02b7d44
scores = np.array(scores) | |
scores_sum = np.sum(scores, axis=0) | |
scores_sum_rev = scores_sum[::-1] | |
argmax_idx = len(scores_sum) - np.argmax(scores_sum_rev) - 1 | |
n_features_to_select = max( | |
n_features - (argmax_idx * step), self.min_features_to_select | |
) |
could be replaced by
scores_sum = np.sum(scores, axis=0)
n_features_to_select = rfe.n_features_list_[np.argmax(scores_sum)]
where rfe.n_features_list_
is the new attribute, better name to be found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for the clean PR. In addition to @jeremiedbb's comment, here are a few more:
@ogrisel and @jeremiedbb think I have addressed most of your suggestions. Did as @jeremiedbb proposed and added a There are still two tests failing that I'm not quite sure how to get passing. The first is that on the below, when
The second error is on Finally, it seems like I also have an error on the changelog even though I added the changes to the changelog for v1.5. Also not sure if I'm doing something wrong here - should have added to v1.4 instead? Would be great if one of the maintainers could steer me in the right direction on the points above. Hopefully I could then learn this for next time. |
Could you please add an entry to the changelog under I will try to review the rest later today. |
Thanks @miguelcsilva. I directly pushedsome changes to fix the CI issues.
It could be fixed by reversing the arrays so that in case of tie breaks the lowest number of features is selected. I pushed it back in my last commit. I still find it a lot more readable than before. And I added a comment to explain why we reverse.
Actually I reverted making it a public attribute. I only made it a fitted attribute only accessible by RFECV, like the
Since I made the attribute not public, this CI issue is thus no longer relevant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking about variable names, but LGTM, whatever your final decision.
Thanks for your help with this @jeremiedbb. Just read through the changes you made. Very insightful. Will try to keep it in mind for the next one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @miguelcsilva !
Reference Issues/PRs
Addresses the suggestion made here.
What does this implement/fix? Explain your changes.
Add a new key to the

RFECV.cv_results_
dictionary. This key is namedn_features
and its value is a numpy array with the number of features used at each step of the recursive feature elimination process.It also adds a new test that verifies: 1) the added array is correct; 2) the size of all arrays of this dict is the same.
Finally, it updates the docs here to make use of the simplified way to build the plot. See below plot for the rendered version of the new doc page:
Any other comments?
Tried to make the code roughly aligned with current codebase logic, though I'm not sure I've fully been able to adhere to the repo spirit (specially on the type hints in the tests). So feel free to propose any changes/corrections.