-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RFECV docstring does not state how the cv_results_
attribute is ordered by
#28580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. Feel free to open a PR to improve the docstring. |
@ArturoSbr are you working on this? Would be interested to contribute. |
@ArturoSbr also suggested the following:
I think this would further help usability of this attribute (e.g. by loading it in a dataframe to plot We could then update the following example to make it simpler by using https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html @miguelcsilva would you be interested in working in separate PR to implement this? |
Yes, thanks for that. I'll get started working on a separate PR for this. |
@ogrisel while working on this I noticed the following unexpected behavior. If I initialize the from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from pandas import DataFrame
X, y = make_classification(n_samples=1000, n_features=20, n_redundant=0, n_classes=2, random_state=0)
rfecv = RFECV(
estimator=LogisticRegression(random_state=0),
min_features_to_select=21,
cv=2,
step=2,
scoring="precision"
)
rfecv.fit(X=X, y=y)
print(DataFrame(rfecv.cv_results_)) Although I do not know the code base too well, my suspicion is that we start from an array with the value for One option would be to raise an if self.min_features_to_select > n_features:
raise InvalidParameterError(
"Minimum number of features to select cannot exceed maximum number of available features."
) Let me know if you think it is worth throwing an error here or if you have any other ideas? And also if you think this is better treated in the same PR I'm currently working on, or in a new one? |
Let's open a dedicated PR for this! |
Describe the issue linked to the documentation
This StackOverflow post has more details regarding this small issue.
In essence, I noticed that the documentation for RFECV does not state how the
cv_results_
attribute is ordered by.Given that the process is recursive, some users (myself included) may assume that the dictionary is sorted in descending order (i.e., the first element corresponds to the models that used ALL features, then one step less, then two steps less, etc.). However, it seems to me that the dictionary is sorted in ascending order.
Suggest a potential alternative/fix
From my perspective, the easiest fix would be to add a few lines to the docstring. Something along the lines of:
As an alternative, the resulting dictionary could have an additional key named
n_features
(or something along those lines) that states how many features each element in the dictionary represents.The text was updated successfully, but these errors were encountered: