Thanks to visit codestin.com
Credit goes to github.com

Skip to content

RFECV docstring does not state how the cv_results_ attribute is ordered by #28580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ArturoSbr opened this issue Mar 6, 2024 · 6 comments · Fixed by #28646
Closed

RFECV docstring does not state how the cv_results_ attribute is ordered by #28580

ArturoSbr opened this issue Mar 6, 2024 · 6 comments · Fixed by #28646

Comments

@ArturoSbr
Copy link
Contributor

Describe the issue linked to the documentation

This StackOverflow post has more details regarding this small issue.

In essence, I noticed that the documentation for RFECV does not state how the cv_results_ attribute is ordered by.

Given that the process is recursive, some users (myself included) may assume that the dictionary is sorted in descending order (i.e., the first element corresponds to the models that used ALL features, then one step less, then two steps less, etc.). However, it seems to me that the dictionary is sorted in ascending order.

Suggest a potential alternative/fix

From my perspective, the easiest fix would be to add a few lines to the docstring. Something along the lines of:

This dictionary is sorted by the number of features in ascending order (i.e., the first element represents the models that use the least number of features, while the last element represents the models that use all available features).

As an alternative, the resulting dictionary could have an additional key named n_features (or something along those lines) that states how many features each element in the dictionary represents.

@ArturoSbr ArturoSbr added Documentation Needs Triage Issue requires triage labels Mar 6, 2024
@adrinjalali adrinjalali removed the Needs Triage Issue requires triage label Mar 6, 2024
@adrinjalali
Copy link
Member

Thanks for the report. Feel free to open a PR to improve the docstring.

@miguelcsilva
Copy link
Contributor

miguelcsilva commented Mar 17, 2024

@ArturoSbr are you working on this? Would be interested to contribute.

@ogrisel
Copy link
Member

ogrisel commented Mar 19, 2024

@ArturoSbr also suggested the following:

As an alternative, the resulting dictionary could have an additional key named n_features (or something along those lines) that states how many features each element in the dictionary represents.

I think this would further help usability of this attribute (e.g. by loading it in a dataframe to plot n_features vs mean_test_score with error bars.

We could then update the following example to make it simpler by using pd.DataFrame(rfecv.cv_results_) and not having to magically recompute the x values of the plot with `range(min_features_to_select, n_scores + min_features_to_select):

https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html

@miguelcsilva would you be interested in working in separate PR to implement this?

@miguelcsilva
Copy link
Contributor

@miguelcsilva would you be interested in working in separate PR to implement this?

Yes, thanks for that. I'll get started working on a separate PR for this.

@miguelcsilva
Copy link
Contributor

@ogrisel while working on this I noticed the following unexpected behavior. If I initialize the RFECV with a min_features_to_select larger than the number of features that I pass to the fit method, I do not get an error (as I expected), but instead a result is returned. See the minimal example below:

from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from pandas import DataFrame

X, y = make_classification(n_samples=1000, n_features=20, n_redundant=0, n_classes=2, random_state=0)

rfecv = RFECV(
    estimator=LogisticRegression(random_state=0),
    min_features_to_select=21,
    cv=2,
    step=2,
    scoring="precision"
)
rfecv.fit(X=X, y=y)

print(DataFrame(rfecv.cv_results_))

Although I do not know the code base too well, my suspicion is that we start from an array with the value for min_features_to_select already present and append the remaining steps until we reach the maximum number of features. This however means that we get the unexpected behavior described above.

One option would be to raise an InvalidParameterError at the start of the fit method like this:

        if self.min_features_to_select > n_features:
            raise InvalidParameterError(
                "Minimum number of features to select cannot exceed maximum number of available features."
            )

Let me know if you think it is worth throwing an error here or if you have any other ideas? And also if you think this is better treated in the same PR I'm currently working on, or in a new one?

@ogrisel
Copy link
Member

ogrisel commented Mar 19, 2024

Let's open a dedicated PR for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants