Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fix OvOClassifier.n_features_in_ and other unexpected warning at prediction time when checking feature_names_in_ #20853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Sep 1, 2021

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Aug 26, 2021

Reference Issues/PRs

Follow up to #18010

What does this implement/fix? Explain your changes.

Add a new test to make sure that we do not raise an expected warning when checking the input data at prediction time, notably in meta-estimators: if the data is validated by the meta estimator at fit time, it should also be validated by the meta-estimator at predict time. The base estimator should therefore not receive a dataframe with column names at predict time.

@ogrisel
Copy link
Member Author

ogrisel commented Aug 26, 2021

@thomasjpfan @lorentzenchr feel free to push directly into this draft PR to fix the remaining problems. I probably won't have time to work on in the coming hours / day.

@ogrisel ogrisel changed the title Add test to make avoid raising unexpected warning at prediction time Add test to make avoid raising unexpected warning at prediction time when checking feature_names_in_ Aug 27, 2021
@ogrisel ogrisel marked this pull request as ready for review August 27, 2021 13:31
Copy link
Member Author

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan @lorentzenchr this PR is now ready and is the last needed to get a complete and clean running test_pandas_column_name_consistency for all the scikit-learn modules.

WDYT about the proposed changes below?

# we need row slicing support for sparce matrices, but costly finiteness check
# can be delegated to the base estimator.
X, y = self._validate_data(
X, y, accept_sparse=["csr", "csc", "lil", "dok"], force_all_finite=False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here for force_all_finite=False.

There are probably other meta-estimators that could benefit from a similar treatment.

return self.estimators_[0]._validate_X_predict(X, check_input=True)
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
raise ValueError("No support for np.int64 index based sparse matrices")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is the equivalent to self.estimators_[0]._validate_X_predict(X, check_input=True) but this avoid having calling self.estimators_[0]._validate_data(X, reset=False) with X being a dataframe only at predict time which would cause the "was fitted without feature names"-warning to be raised.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is raising ValueError tested somewhere? I couldn't find it. But this is unrelated to this PR, indeed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this would be out of the scope of this PR which is already more complex that originally intended.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to not be covered by the test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we need proper test coverage for this.

@@ -740,9 +739,6 @@ def fit(self, X, y):

self.estimators_ = estimators_indices[0]

if hasattr(self.estimators_[0], "n_features_in_"):
self.n_features_in_ = self.estimators_[0].n_features_in_

Copy link
Member Author

@ogrisel ogrisel Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually causing a bug in OvOClassifier when X is a pre-computed kernel:

The number of features for the OvO meta-estimator is larger than the number of features of base estimators because when we do OvO, we remove the samples of the classes we are not interested in, and therefore the number of "features" since the columns of a precomputed kernel matrix X are actually samples, not features.

This bug was previously silent but started to break once I fixed the predict time validation to silence the warning. I will try to write a dedicated test and probably document the fix in a changelog entry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a dedicated test (I see, it's already done). No need for a changelog entry, see comment there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for changelog but we should have this PR in 1.0 then.

score_precomputed = cross_val_score(
multiclass_clf_precomputed, linear_kernel, y, error_score="raise"
)
assert_array_equal(score_precomputed, score_not_precomputed)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the test is fundamentally unchanged, it just fails more informatively with error_score="raise" and use more explicit (correct) variable names and leverage pytest parametrization.

@ogrisel ogrisel changed the title Add test to make avoid raising unexpected warning at prediction time when checking feature_names_in_ Fix OvOClassifier.n_features_in_ and other unexpected warning at prediction time when checking feature_names_in_ Aug 27, 2021
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only a few comments. Thanks @ogrisel for all the effort of finalizing n_features_in_!

return self.estimators_[0]._validate_X_predict(X, check_input=True)
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
raise ValueError("No support for np.int64 index based sparse matrices")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is raising ValueError tested somewhere? I couldn't find it. But this is unrelated to this PR, indeed.

@@ -740,9 +739,6 @@ def fit(self, X, y):

self.estimators_ = estimators_indices[0]

if hasattr(self.estimators_[0], "n_features_in_"):
self.n_features_in_ = self.estimators_[0].n_features_in_

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a dedicated test (I see, it's already done). No need for a changelog entry, see comment there.

Comment on lines 857 to 867
# This becomes really interesting with OvO and precomputed kernel together:
# internally, OvO will drop the samples of the classes not part of the pair
# of classes under consideration for a given binary classifier. Since we
# use a precomputed kernel, it will also drop the matching columns of the
# kernel matrix, and therefore we have fewer "features" as result. Since
# each class has 50 samples, a single OvO binary classifier works with a
# subkernel matrix of shape (100, 100).
ovo_precomputed = OneVsOneClassifier(clf_precomputed).fit(K, y)
assert ovo_precomputed.n_features_in_ == 150
for est in ovo_precomputed.estimators_:
assert est.n_features_in_ == 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so familiar with OneVsRestClassifier. A second reviewer might have a closer look here.

Copy link
Member Author

@ogrisel ogrisel Aug 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean OvO ? For 3 classes 0, 1 and 2, OvO internally fits 3 binary classifiers:

  • class 0 vs class 1
  • class 0 vs class 2
  • class 1 vs class 2

as each class has 50 samples on the iris dataset, each sub-training set has 50 + 50 == 100 samples and therefore 100 columns in its input kernel matrix.

@ogrisel ogrisel added this to the 1.0 milestone Aug 30, 2021
@lorentzenchr
Copy link
Member

@thomasjpfan @glemaitre I guess, this is the last one for n_features_in. We're soooooo close 🏃😀

# use a precomputed kernel, it will also drop the matching columns of the
# kernel matrix, and therefore we have fewer "features" as result. Since
# each class has 50 samples, a single OvO binary classifier works with a
# subkernel matrix of shape (100, 100).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the number of samples is not the same in each class. I assume that we report the number of features of the first binary classifier. I am not really sure if this is the right thing to report.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the meta-estimators' n_features_in_ is X.shape[1]. We don't report the n_features_in_ of the first clf any more. It's the n_features_in_ of each underlying estimator that will not be the same. (if I understood correctly :))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We missed this. It could be nice to not have the same number of samples in each class and check the consistency in the underlying binary classifier.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! (minus @glemaitre comment)

@ogrisel
Copy link
Member Author

ogrisel commented Sep 1, 2021

I updated the new test to make this more explicit.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ogrisel ogrisel merged commit 9fab358 into scikit-learn:main Sep 1, 2021
@ogrisel ogrisel deleted the feature_names_in_-warnings branch September 1, 2021 09:36
@ogrisel
Copy link
Member Author

ogrisel commented Sep 1, 2021

Merged! Thanks for the reviews!

ogrisel added a commit to scikit-learn-inria-fondation/follow-up that referenced this pull request Sep 7, 2021
## August 31th, 2021

### Gael

* TODO: Jeremy's renewal, Chiara's replacement, Mathis's consulting gig

### Olivier

- input feature names: main PR [#18010](scikit-learn/scikit-learn#18010) that links into sub PRs
  - remaining (need review): [#20853](scikit-learn/scikit-learn#20853) (found a bug in `OvOClassifier.n_features_in_`)
- reviewing `get_feature_names_out`: [#18444](scikit-learn/scikit-learn#18444)
- next: give feedback to Chiara on ARM wheel building [#20711](scikit-learn/scikit-learn#20711) (needed for the release)
- next: assist Adrin for the release process
- next: investigate regression in loky that blocks the cloudpickle release [#432](cloudpipe/cloudpickle#432)
- next: come back to intel to write a technical roadmap for a possible collaboration

### Julien

 - Was on holidays
 - Planned week @ Nexedi, Lille, from September 13th to 17th
 - Reviewed PRs
     - [`#20567`](scikit-learn/scikit-learn#20567) Common Private Loss module
     - [`#18310`](scikit-learn/scikit-learn#18310) ENH Add option to centered ICE plots (cICE)
     - Others PRs prior to holidays
 - [`#20254`](scikit-learn/scikit-learn#20254)
     - Adapted benchmarks on `pdist_aggregation` to test #20254 against sklearnex
     - Adapting PR for `fast_euclidean` and `fast_sqeuclidean` on user-facing APIs
     - Next: comparing against scipy's 
     - Next: Having feedback on [#20254](scikit-learn/scikit-learn#20254) would also help
- Next: I need to block time to study Cython code.

### Mathis
- `sklearn_benchmarks`
  - Adapting benchmark script to run on Margaret
  - Fix issue with profiling files too big to be deployed on Github Pages
  - Ensure deterministic benchmark results
  - Working on declarative pipeline specification
  - Next: run long HPO benchmarks on Margaret

### Arturo

- Finished MOOC!
- Finished filling [Loïc's notes](https://notes.inria.fr/rgSzYtubR6uSOQIfY9Fpvw#) to find questions with score under 60% (Issue [#432](INRIA/scikit-learn-mooc#432))
    - started addressing easy-to-fix questions, resulting in gitlab MRs [#21](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/21) and [#22](https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/merge_requests/22)
    - currently working on expanding the notes up to 70%
- Continued cross-linking forum posts with issues in GitHub, resulting in [#444](INRIA/scikit-learn-mooc#444), [#445](INRIA/scikit-learn-mooc#445), [#446](INRIA/scikit-learn-mooc#446), [#447](INRIA/scikit-learn-mooc#447) and [#448](INRIA/scikit-learn-mooc#448)

### Jérémie
- back from holidays, catching up
- Mathis' benchmarks
- trying to find what's going on with ASV benchmarks
  (asv should display the versions of all build and runtime depndencies for each run)

### Guillaume

- back from holidays
- Next:
    - release with Adrin
    - check the PR and issue trackers

### TODO / Next

- Expand Loïc’s notes up to 70% (Arturo)
- Create presentation to discuss my experience doing the MOOC (Arturo)
- Help with the scikit-learn release (Olivier, Guillaume)
- HR: Jeremy's renewal, Chiara's replacement (Gael)
- Mathis's consulting gig (Olivier, Gael, Mathis)
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
…iction time when checking feature_names_in_ (scikit-learn#20853)

Co-authored-by: Christian Lorentzen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants