Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FIX Sets feature_names_in_ for estimators in Bagging* #21811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #21599

What does this implement/fix? Explain your changes.

This PR sets feature_names_in_ for all inner estimators in Bagging*. I placed this in 1.0.2 for now, but open to move it to 1.1.

Any other comments?

Should we set feature_names_in_ for all meta-estimators that needs to validate first?

@thomasjpfan thomasjpfan added this to the 1.0.2 milestone Nov 28, 2021
@thomasjpfan
Copy link
Member Author

thomasjpfan commented Nov 29, 2021

Turns out this is not so simple:

Setting feature_names_in_ breaks our tests for not raising warnings in non-fit methods. When Bagging* manually sets feature_names_in_ in its child estimators, they will expect pandas dataframes. On the other hand Bagging*.predict_proba, will initially validate X and converted it into an NumPy array before calling predict_proba in the child estimators, which creates a unnecessary warning.

@thomasjpfan thomasjpfan modified the milestones: 1.0.2, 1.1 Dec 21, 2021
@thomasjpfan
Copy link
Member Author

Moving to 1.1 since setting feature_names_in_ in estimators_ is a behavior change and does not look like a bug fix.

@jeremiedbb jeremiedbb added this to the 1.1 milestone Mar 15, 2022
@thomasjpfan thomasjpfan removed this from the 1.1 milestone Mar 16, 2022
@thomasjpfan
Copy link
Member Author

I still have the same concerns with this PR from #21811 (comment)

I think the "proper solution" is to adapt BaseBagging to use _safe_split and slice the dataframe and pass the datafarme around. From my understanding of joblib, if we do not cast to a ndarray during BaseBagging.fit then we lose the benefit of memmapping a ndarray when dispatching the parallel calls.

@jeremiedbb
Copy link
Member

According to @ogrisel, memmaping should work with X being a dataframe. It would memmap the underlying numpy array. So I think it should be possible to not call _validate_data but only _check_feature_names in the bagging estimator and pass X as is in the call to parallel.

@jeremiedbb
Copy link
Member

import numpy as np
import pandas as pd
from joblib import Parallel, delayed

X = np.random.RandomState(0).random_sample((100, 100))
df = pd.DataFrame(X)

def f(df):
    print(type(df.values.base.base))
    print(df.values.flags)

def g(x):
    print(type(X))
    print(X.flags)

_ = Parallel(n_jobs=2, max_nbytes=1)(delayed(f)(df) for i in range(2))

<class 'numpy.memmap'>
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

<class 'numpy.memmap'>
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

_ = Parallel(n_jobs=2, max_nbytes=1)(delayed(g)(X) for i in range(2))

<class 'numpy.memmap'>
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

<class 'numpy.memmap'>
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

@thomasjpfan
Copy link
Member Author

Updated PR to use _safe_indexing everywhere. I also have the bagging estimator check the n_features_in_ because it requires the same number of feature to slice correctly.

This commit: 87b63ee adds axis=1 support to list of list, because it is required by where we end up slicing a list of list with axis=1.

def test_set_oob_score_label_encoding():
# Make sure the oob_score doesn't change when the labels change
# See: https://github.com/scikit-learn/scikit-learn/issues/8933
random_state = 5
X = [[-1], [0], [1]] * 5

Fundamentally, 87b63ee can be it's own PR if we want to reduce the scope fo this PR.

@jeremiedbb
Copy link
Member

Fundamentally, 87b63ee can be it's own PR if we want to reduce the scope fo this PR.

I think it can go into this PR since it's the only use case for now

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# Bagging may slice `X` along the features axis before passing `X` to the base
# estimators. This means that each estimator may get a different subset of
# features. For bagging to slice correctly during non-fit methods, bagging
# needs to set the feature names in `fit` and validate them in non-fit methods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# needs to set the feature names in `fit` and validate them in non-fit methods.
# needs to set the feature names in `fit` and validate them in non-fit methods.
# This is usually done in _validate_data, but we don't want to validate data in
# this meta-estimator. The sub-estimators will do that.

return list(container)
else:
return np.asarray(container, dtype=dtype).tolist()
return np.asarray(container, dtype=dtype).tolist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return np.asarray(container, dtype=dtype).tolist()
# calling .to_list() on an ndarray returns a list of list
return np.asarray(container, dtype=dtype).tolist()

@@ -444,6 +444,9 @@ Changelog
for instance using cgroups quota in a docker container. :pr:`22566` by
:user:`Jérémie du Boisberranger <jeremiedbb>`.

- |Fix| :class:`ensemble.BaggingRegressor` and :class:`ensemble.BaggingClassifier`
sets `feature_names_in_` in `estimators_`. :pr:`21811` by `Thomas Fan`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sets `feature_names_in_` in `estimators_`. :pr:`21811` by `Thomas Fan`_.
set ``feature_names_in_`` in ``estimators_``. :pr:`21811` by `Thomas Fan`_.

@jeremiedbb
Copy link
Member

This is more complicated :(
Common tests expect specific error messages in specific order. It should be possible to do it but requires a bit more work and makes us duplicate a lot of code from _validate_data. Ideally, _validate_data should be able to let X pass through (don't convert to ndarray) but still perform some checks, like y not None for instance.

@thomasjpfan thomasjpfan marked this pull request as draft April 7, 2022 16:18
@thomasjpfan
Copy link
Member Author

Yea this is more complex, there are a lot of common test error that arise when Bagging* does not call _validate_data.

I am marking this PR as draft for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

_check_feature_names raises UserWarning when accessing bagged estimators
3 participants