-
-
Notifications
You must be signed in to change notification settings - Fork 26k
Fix OvOClassifier.n_features_in_ and other unexpected warning at prediction time when checking feature_names_in_ #20853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix OvOClassifier.n_features_in_ and other unexpected warning at prediction time when checking feature_names_in_ #20853
Conversation
@thomasjpfan @lorentzenchr feel free to push directly into this draft PR to fix the remaining problems. I probably won't have time to work on in the coming hours / day. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan @lorentzenchr this PR is now ready and is the last needed to get a complete and clean running test_pandas_column_name_consistency
for all the scikit-learn modules.
WDYT about the proposed changes below?
# we need row slicing support for sparce matrices, but costly finiteness check | ||
# can be delegated to the base estimator. | ||
X, y = self._validate_data( | ||
X, y, accept_sparse=["csr", "csc", "lil", "dok"], force_all_finite=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here for force_all_finite=False
.
There are probably other meta-estimators that could benefit from a similar treatment.
return self.estimators_[0]._validate_X_predict(X, check_input=True) | ||
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False) | ||
if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc): | ||
raise ValueError("No support for np.int64 index based sparse matrices") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is the equivalent to self.estimators_[0]._validate_X_predict(X, check_input=True)
but this avoid having calling self.estimators_[0]._validate_data(X, reset=False)
with X being a dataframe only at predict time which would cause the "was fitted without feature names"-warning to be raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is raising ValueError tested somewhere? I couldn't find it. But this is unrelated to this PR, indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this would be out of the scope of this PR which is already more complex that originally intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to not be covered by the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed we need proper test coverage for this.
…have a different number of 'features' than meta-estimator
@@ -740,9 +739,6 @@ def fit(self, X, y): | |||
|
|||
self.estimators_ = estimators_indices[0] | |||
|
|||
if hasattr(self.estimators_[0], "n_features_in_"): | |||
self.n_features_in_ = self.estimators_[0].n_features_in_ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually causing a bug in OvOClassifier
when X is a pre-computed kernel:
The number of features for the OvO meta-estimator is larger than the number of features of base estimators because when we do OvO, we remove the samples of the classes we are not interested in, and therefore the number of "features" since the columns of a precomputed kernel matrix X
are actually samples, not features.
This bug was previously silent but started to break once I fixed the predict time validation to silence the warning. I will try to write a dedicated test and probably document the fix in a changelog entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for a dedicated test (I see, it's already done). No need for a changelog entry, see comment there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for changelog but we should have this PR in 1.0 then.
score_precomputed = cross_val_score( | ||
multiclass_clf_precomputed, linear_kernel, y, error_score="raise" | ||
) | ||
assert_array_equal(score_precomputed, score_not_precomputed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the test is fundamentally unchanged, it just fails more informatively with error_score="raise"
and use more explicit (correct) variable names and leverage pytest parametrization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only a few comments. Thanks @ogrisel for all the effort of finalizing n_features_in_
!
return self.estimators_[0]._validate_X_predict(X, check_input=True) | ||
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False) | ||
if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc): | ||
raise ValueError("No support for np.int64 index based sparse matrices") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is raising ValueError tested somewhere? I couldn't find it. But this is unrelated to this PR, indeed.
@@ -740,9 +739,6 @@ def fit(self, X, y): | |||
|
|||
self.estimators_ = estimators_indices[0] | |||
|
|||
if hasattr(self.estimators_[0], "n_features_in_"): | |||
self.n_features_in_ = self.estimators_[0].n_features_in_ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for a dedicated test (I see, it's already done). No need for a changelog entry, see comment there.
sklearn/tests/test_multiclass.py
Outdated
# This becomes really interesting with OvO and precomputed kernel together: | ||
# internally, OvO will drop the samples of the classes not part of the pair | ||
# of classes under consideration for a given binary classifier. Since we | ||
# use a precomputed kernel, it will also drop the matching columns of the | ||
# kernel matrix, and therefore we have fewer "features" as result. Since | ||
# each class has 50 samples, a single OvO binary classifier works with a | ||
# subkernel matrix of shape (100, 100). | ||
ovo_precomputed = OneVsOneClassifier(clf_precomputed).fit(K, y) | ||
assert ovo_precomputed.n_features_in_ == 150 | ||
for est in ovo_precomputed.estimators_: | ||
assert est.n_features_in_ == 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so familiar with OneVsRestClassifier
. A second reviewer might have a closer look here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean OvO ? For 3 classes 0, 1 and 2, OvO internally fits 3 binary classifiers:
- class 0 vs class 1
- class 0 vs class 2
- class 1 vs class 2
as each class has 50 samples on the iris dataset, each sub-training set has 50 + 50 == 100 samples and therefore 100 columns in its input kernel matrix.
Co-authored-by: Christian Lorentzen <[email protected]>
@thomasjpfan @glemaitre I guess, this is the last one for |
sklearn/tests/test_multiclass.py
Outdated
# use a precomputed kernel, it will also drop the matching columns of the | ||
# kernel matrix, and therefore we have fewer "features" as result. Since | ||
# each class has 50 samples, a single OvO binary classifier works with a | ||
# subkernel matrix of shape (100, 100). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the number of samples is not the same in each class. I assume that we report the number of features of the first binary classifier. I am not really sure if this is the right thing to report.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, the meta-estimators' n_features_in_ is X.shape[1]. We don't report the n_features_in_ of the first clf any more. It's the n_features_in_ of each underlying estimator that will not be the same. (if I understood correctly :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We missed this. It could be nice to not have the same number of samples in each class and check the consistency in the underlying binary classifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! (minus @glemaitre comment)
I updated the new test to make this more explicit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merged! Thanks for the reviews! |
## August 31th, 2021 ### Gael * TODO: Jeremy's renewal, Chiara's replacement, Mathis's consulting gig ### Olivier - input feature names: main PR [#18010](scikit-learn/scikit-learn#18010) that links into sub PRs - remaining (need review): [#20853](scikit-learn/scikit-learn#20853) (found a bug in `OvOClassifier.n_features_in_`) - reviewing `get_feature_names_out`: [#18444](scikit-learn/scikit-learn#18444) - next: give feedback to Chiara on ARM wheel building [#20711](scikit-learn/scikit-learn#20711) (needed for the release) - next: assist Adrin for the release process - next: investigate regression in loky that blocks the cloudpickle release [#432](cloudpipe/cloudpickle#432) - next: come back to intel to write a technical roadmap for a possible collaboration ### Julien - Was on holidays - Planned week @ Nexedi, Lille, from September 13th to 17th - Reviewed PRs - [`#20567`](scikit-learn/scikit-learn#20567) Common Private Loss module - [`#18310`](scikit-learn/scikit-learn#18310) ENH Add option to centered ICE plots (cICE) - Others PRs prior to holidays - [`#20254`](scikit-learn/scikit-learn#20254) - Adapted benchmarks on `pdist_aggregation` to test #20254 against sklearnex - Adapting PR for `fast_euclidean` and `fast_sqeuclidean` on user-facing APIs - Next: comparing against scipy's - Next: Having feedback on [#20254](scikit-learn/scikit-learn#20254) would also help - Next: I need to block time to study Cython code. ### Mathis - `sklearn_benchmarks` - Adapting benchmark script to run on Margaret - Fix issue with profiling files too big to be deployed on Github Pages - Ensure deterministic benchmark results - Working on declarative pipeline specification - Next: run long HPO benchmarks on Margaret ### Arturo - Finished MOOC! - Finished filling [Loïc's notes](https://notes.inria.fr/rgSzYtubR6uSOQIfY9Fpvw#) to find questions with score under 60% (Issue [#432](INRIA/scikit-learn-mooc#432)) - started addressing easy-to-fix questions, resulting in gitlab MRs [#21](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/21) and [#22](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/22) - currently working on expanding the notes up to 70% - Continued cross-linking forum posts with issues in GitHub, resulting in [#444](INRIA/scikit-learn-mooc#444), [#445](INRIA/scikit-learn-mooc#445), [#446](INRIA/scikit-learn-mooc#446), [#447](INRIA/scikit-learn-mooc#447) and [#448](INRIA/scikit-learn-mooc#448) ### Jérémie - back from holidays, catching up - Mathis' benchmarks - trying to find what's going on with ASV benchmarks (asv should display the versions of all build and runtime depndencies for each run) ### Guillaume - back from holidays - Next: - release with Adrin - check the PR and issue trackers ### TODO / Next - Expand Loïc’s notes up to 70% (Arturo) - Create presentation to discuss my experience doing the MOOC (Arturo) - Help with the scikit-learn release (Olivier, Guillaume) - HR: Jeremy's renewal, Chiara's replacement (Gael) - Mathis's consulting gig (Olivier, Gael, Mathis)
…iction time when checking feature_names_in_ (scikit-learn#20853) Co-authored-by: Christian Lorentzen <[email protected]>
Reference Issues/PRs
Follow up to #18010
What does this implement/fix? Explain your changes.
Add a new test to make sure that we do not raise an expected warning when checking the input data at prediction time, notably in meta-estimators: if the data is validated by the meta estimator at fit time, it should also be validated by the meta-estimator at predict time. The base estimator should therefore not receive a dataframe with column names at predict time.