Fix OvOClassifier.n_features_in_ and other unexpected warning at prediction time when checking feature_names_in_ #20853

ogrisel · 2021-08-26T20:44:19Z

Reference Issues/PRs

Follow up to #18010

What does this implement/fix? Explain your changes.

Add a new test to make sure that we do not raise an expected warning when checking the input data at prediction time, notably in meta-estimators: if the data is validated by the meta estimator at fit time, it should also be validated by the meta-estimator at predict time. The base estimator should therefore not receive a dataframe with column names at predict time.

sklearn/tests/test_common.py

ogrisel · 2021-08-26T20:45:35Z

@thomasjpfan @lorentzenchr feel free to push directly into this draft PR to fix the remaining problems. I probably won't have time to work on in the coming hours / day.

…forests

…eClassifier

…egressor

…iningClassifier

ogrisel

@thomasjpfan @lorentzenchr this PR is now ready and is the last needed to get a complete and clean running test_pandas_column_name_consistency for all the scikit-learn modules.

WDYT about the proposed changes below?

sklearn/linear_model/_ransac.py

ogrisel · 2021-08-27T13:33:11Z

sklearn/semi_supervised/_self_training.py

+        # we need row slicing support for sparce matrices, but costly finiteness check
+        # can be delegated to the base estimator.
+        X, y = self._validate_data(
+            X, y, accept_sparse=["csr", "csc", "lil", "dok"], force_all_finite=False


Same comment here for force_all_finite=False.

There are probably other meta-estimators that could benefit from a similar treatment.

ogrisel · 2021-08-27T13:35:48Z

sklearn/ensemble/_forest.py

-        return self.estimators_[0]._validate_X_predict(X, check_input=True)
+        X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
+        if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
+            raise ValueError("No support for np.int64 index based sparse matrices")


This check is the equivalent to self.estimators_[0]._validate_X_predict(X, check_input=True) but this avoid having calling self.estimators_[0]._validate_data(X, reset=False) with X being a dataframe only at predict time which would cause the "was fitted without feature names"-warning to be raised.

Is raising ValueError tested somewhere? I couldn't find it. But this is unrelated to this PR, indeed.

I agree this would be out of the scope of this PR which is already more complex that originally intended.

It seems to not be covered by the test.

Indeed we need proper test coverage for this.

…have a different number of 'features' than meta-estimator

ogrisel · 2021-08-27T16:31:00Z

sklearn/multiclass.py

@@ -740,9 +739,6 @@ def fit(self, X, y):

        self.estimators_ = estimators_indices[0]

-        if hasattr(self.estimators_[0], "n_features_in_"):
-            self.n_features_in_ = self.estimators_[0].n_features_in_
-


This was actually causing a bug in OvOClassifier when X is a pre-computed kernel:

The number of features for the OvO meta-estimator is larger than the number of features of base estimators because when we do OvO, we remove the samples of the classes we are not interested in, and therefore the number of "features" since the columns of a precomputed kernel matrix X are actually samples, not features.

This bug was previously silent but started to break once I fixed the predict time validation to silence the warning. I will try to write a dedicated test and probably document the fix in a changelog entry.

+1 for a dedicated test (I see, it's already done). No need for a changelog entry, see comment there.

No need for changelog but we should have this PR in 1.0 then.

…tions explicit

sklearn/tests/test_multiclass.py

ogrisel · 2021-08-27T17:04:14Z

sklearn/tests/test_multiclass.py

+    score_precomputed = cross_val_score(
+        multiclass_clf_precomputed, linear_kernel, y, error_score="raise"
+    )
+    assert_array_equal(score_precomputed, score_not_precomputed)


Here the test is fundamentally unchanged, it just fails more informatively with error_score="raise" and use more explicit (correct) variable names and leverage pytest parametrization.

lorentzenchr

LGTM, only a few comments. Thanks @ogrisel for all the effort of finalizing n_features_in_!

doc/whats_new/v1.0.rst

lorentzenchr · 2021-08-28T08:17:39Z

sklearn/ensemble/_forest.py

-        return self.estimators_[0]._validate_X_predict(X, check_input=True)
+        X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
+        if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
+            raise ValueError("No support for np.int64 index based sparse matrices")


Is raising ValueError tested somewhere? I couldn't find it. But this is unrelated to this PR, indeed.

sklearn/linear_model/_ransac.py

lorentzenchr · 2021-08-28T08:23:16Z

sklearn/multiclass.py

@@ -740,9 +739,6 @@ def fit(self, X, y):

        self.estimators_ = estimators_indices[0]

-        if hasattr(self.estimators_[0], "n_features_in_"):
-            self.n_features_in_ = self.estimators_[0].n_features_in_
-


+1 for a dedicated test (I see, it's already done). No need for a changelog entry, see comment there.

sklearn/tests/test_multiclass.py

lorentzenchr · 2021-08-28T08:36:07Z

sklearn/tests/test_multiclass.py

+    # This becomes really interesting with OvO and precomputed kernel together:
+    # internally, OvO will drop the samples of the classes not part of the pair
+    # of classes under consideration for a given binary classifier. Since we
+    # use a precomputed kernel, it will also drop the matching columns of the
+    # kernel matrix, and therefore we have fewer "features" as result. Since
+    # each class has 50 samples, a single OvO binary classifier works with a
+    # subkernel matrix of shape (100, 100).
+    ovo_precomputed = OneVsOneClassifier(clf_precomputed).fit(K, y)
+    assert ovo_precomputed.n_features_in_ == 150
+    for est in ovo_precomputed.estimators_:
+        assert est.n_features_in_ == 100


I'm not so familiar with OneVsRestClassifier. A second reviewer might have a closer look here.

You mean OvO ? For 3 classes 0, 1 and 2, OvO internally fits 3 binary classifiers:

class 0 vs class 1

class 0 vs class 2

class 1 vs class 2

as each class has 50 samples on the iris dataset, each sub-training set has 50 + 50 == 100 samples and therefore 100 columns in its input kernel matrix.

Co-authored-by: Christian Lorentzen <[email protected]>

lorentzenchr · 2021-08-31T22:06:59Z

@thomasjpfan @glemaitre I guess, this is the last one for n_features_in. We're soooooo close 🏃😀

sklearn/tests/test_multiclass.py

glemaitre · 2021-09-01T08:27:18Z

sklearn/tests/test_multiclass.py

+    # use a precomputed kernel, it will also drop the matching columns of the
+    # kernel matrix, and therefore we have fewer "features" as result. Since
+    # each class has 50 samples, a single OvO binary classifier works with a
+    # subkernel matrix of shape (100, 100).


What happens if the number of samples is not the same in each class. I assume that we report the number of features of the first binary classifier. I am not really sure if this is the right thing to report.

no, the meta-estimators' n_features_in_ is X.shape[1]. We don't report the n_features_in_ of the first clf any more. It's the n_features_in_ of each underlying estimator that will not be the same. (if I understood correctly :))

Right. We missed this. It could be nice to not have the same number of samples in each class and check the consistency in the underlying binary classifier.

jeremiedbb

LGTM ! (minus @glemaitre comment)

ogrisel · 2021-09-01T08:56:44Z

I updated the new test to make this more explicit.

sklearn/tests/test_multiclass.py

glemaitre

LGTM

ogrisel · 2021-09-01T09:36:25Z

Merged! Thanks for the reviews!

## August 31th, 2021 ### Gael * TODO: Jeremy's renewal, Chiara's replacement, Mathis's consulting gig ### Olivier - input feature names: main PR [#18010](scikit-learn/scikit-learn#18010) that links into sub PRs - remaining (need review): [#20853](scikit-learn/scikit-learn#20853) (found a bug in `OvOClassifier.n_features_in_`) - reviewing `get_feature_names_out`: [#18444](scikit-learn/scikit-learn#18444) - next: give feedback to Chiara on ARM wheel building [#20711](scikit-learn/scikit-learn#20711) (needed for the release) - next: assist Adrin for the release process - next: investigate regression in loky that blocks the cloudpickle release [#432](cloudpipe/cloudpickle#432) - next: come back to intel to write a technical roadmap for a possible collaboration ### Julien - Was on holidays - Planned week @ Nexedi, Lille, from September 13th to 17th - Reviewed PRs - [`#20567`](scikit-learn/scikit-learn#20567) Common Private Loss module - [`#18310`](scikit-learn/scikit-learn#18310) ENH Add option to centered ICE plots (cICE) - Others PRs prior to holidays - [`#20254`](scikit-learn/scikit-learn#20254) - Adapted benchmarks on `pdist_aggregation` to test #20254 against sklearnex - Adapting PR for `fast_euclidean` and `fast_sqeuclidean` on user-facing APIs - Next: comparing against scipy's - Next: Having feedback on [#20254](scikit-learn/scikit-learn#20254) would also help - Next: I need to block time to study Cython code. ### Mathis - `sklearn_benchmarks` - Adapting benchmark script to run on Margaret - Fix issue with profiling files too big to be deployed on Github Pages - Ensure deterministic benchmark results - Working on declarative pipeline specification - Next: run long HPO benchmarks on Margaret ### Arturo - Finished MOOC! - Finished filling [Loïc's notes](https://notes.inria.fr/rgSzYtubR6uSOQIfY9Fpvw#) to find questions with score under 60% (Issue [#432](INRIA/scikit-learn-mooc#432)) - started addressing easy-to-fix questions, resulting in gitlab MRs [#21](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/21) and [#22](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/22) - currently working on expanding the notes up to 70% - Continued cross-linking forum posts with issues in GitHub, resulting in [#444](INRIA/scikit-learn-mooc#444), [#445](INRIA/scikit-learn-mooc#445), [#446](INRIA/scikit-learn-mooc#446), [#447](INRIA/scikit-learn-mooc#447) and [#448](INRIA/scikit-learn-mooc#448) ### Jérémie - back from holidays, catching up - Mathis' benchmarks - trying to find what's going on with ASV benchmarks (asv should display the versions of all build and runtime depndencies for each run) ### Guillaume - back from holidays - Next: - release with Adrin - check the PR and issue trackers ### TODO / Next - Expand Loïc’s notes up to 70% (Arturo) - Create presentation to discuss my experience doing the MOOC (Arturo) - Help with the scikit-learn release (Olivier, Guillaume) - HR: Jeremy's renewal, Chiara's replacement (Gael) - Mathis's consulting gig (Olivier, Gael, Mathis)

…iction time when checking feature_names_in_ (scikit-learn#20853) Co-authored-by: Christian Lorentzen <[email protected]>

Add test to make avoid raising unexpected warning at prediction time

b5d6791

ogrisel commented Aug 26, 2021

View reviewed changes

sklearn/tests/test_common.py Outdated Show resolved Hide resolved

ogrisel added the No Changelog Needed label Aug 26, 2021

ogrisel added 3 commits August 26, 2021 23:00

Merge branch 'main' into feature_names_in_-warnings

d2af0a3

Black

44094c4

Avoid raising a warning on column name consistency checks for random …

5a980c1

…forests

github-actions bot added the module:ensemble label Aug 27, 2021

ogrisel changed the title ~~Add test to make avoid raising unexpected warning at prediction time~~ Add test to make avoid raising unexpected warning at prediction time when checking feature_names_in_ Aug 27, 2021

ogrisel added 3 commits August 27, 2021 14:48

Avoid raising a warning on column name consistency checks for OneVsOn…

50a1ecc

…eClassifier

Avoid raising a warning on column name consistency checks for RANSACR…

c784fc2

…egressor

Avoid raising a warning on column name consistency checks for SelfTra…

46dea12

…iningClassifier

ogrisel marked this pull request as ready for review August 27, 2021 13:31

ogrisel added the Waiting for Reviewer label Aug 27, 2021

ogrisel commented Aug 27, 2021

View reviewed changes

ogrisel added 3 commits August 27, 2021 18:01

Make test error message more explicit

09065f8

Make cross-val OVR/OVO test on precomputed pairwise data more readable

12b5b8c

Fix OneVsOneClassifier bug on precomputed kernel: base estimator can …

5da1ff6

…have a different number of 'features' than meta-estimator

ogrisel commented Aug 27, 2021

View reviewed changes

ogrisel added 2 commits August 27, 2021 18:56

Add new test to make the OvO w/ precomputed w/ n_features_in_ interac…

182c995

…tions explicit

Document fix in changelog entry

c94714e

ogrisel added Bug and removed No Changelog Needed labels Aug 27, 2021

ogrisel commented Aug 27, 2021

View reviewed changes

sklearn/tests/test_multiclass.py Outdated Show resolved Hide resolved

Fix test name

318823c

ogrisel commented Aug 27, 2021

View reviewed changes

ogrisel requested review from thomasjpfan and lorentzenchr August 27, 2021 17:14

ogrisel changed the title ~~Add test to make avoid raising unexpected warning at prediction time when checking feature_names_in_~~ Fix OvOClassifier.n_features_in_ and other unexpected warning at prediction time when checking feature_names_in_ Aug 27, 2021

lorentzenchr approved these changes Aug 28, 2021

View reviewed changes

ogrisel and others added 3 commits August 28, 2021 11:50

More explicit test comments

090c660

Better inline comment in RANSAC

d1e6569

Apply suggestions from code review

09ced40

Co-authored-by: Christian Lorentzen <[email protected]>

ogrisel added the No Changelog Needed label Aug 28, 2021

Merge branch 'main' into feature_names_in_-warnings

63f0ce3

ogrisel added this to the 1.0 milestone Aug 30, 2021

glemaitre reviewed Sep 1, 2021

View reviewed changes

jeremiedbb approved these changes Sep 1, 2021

View reviewed changes

ogrisel added 2 commits September 1, 2021 10:50

Make the new test more interesting

ee6a6e3

Update the comments of the new test

f6aac88

Move assertion to make the test easier to follow

2c8daa1

jeremiedbb reviewed Sep 1, 2021

View reviewed changes

sklearn/tests/test_multiclass.py Outdated Show resolved Hide resolved

Move kernel computation to where we actually need it

fd1c9d4

glemaitre approved these changes Sep 1, 2021

View reviewed changes

ogrisel merged commit 9fab358 into scikit-learn:main Sep 1, 2021

ogrisel deleted the feature_names_in_-warnings branch September 1, 2021 09:36

Uh oh!

Fix OvOClassifier.n_features_in_ and other unexpected warning at prediction time when checking feature_names_in_ #20853

Fix OvOClassifier.n_features_in_ and other unexpected warning at prediction time when checking feature_names_in_ #20853

Uh oh!

Conversation

ogrisel commented Aug 26, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

Uh oh!

ogrisel commented Aug 26, 2021

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Aug 27, 2021 • edited by lorentzenchr Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Aug 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr commented Aug 31, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeremiedbb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 1, 2021

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 1, 2021

Uh oh!

Uh oh!

ogrisel Aug 27, 2021 •

edited by lorentzenchr

Loading

ogrisel Aug 28, 2021 •

edited

Loading

jeremiedbb left a comment •

edited

Loading