-
-
Notifications
You must be signed in to change notification settings - Fork 26k
FIX Sets feature_names_in_ for estimators in Bagging* #21811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FIX Sets feature_names_in_ for estimators in Bagging* #21811
Conversation
Turns out this is not so simple: Setting |
Moving to 1.1 since setting |
I still have the same concerns with this PR from #21811 (comment) I think the "proper solution" is to adapt |
According to @ogrisel, memmaping should work with X being a dataframe. It would memmap the underlying numpy array. So I think it should be possible to not call |
import numpy as np
import pandas as pd
from joblib import Parallel, delayed
X = np.random.RandomState(0).random_sample((100, 100))
df = pd.DataFrame(X)
def f(df):
print(type(df.values.base.base))
print(df.values.flags)
def g(x):
print(type(X))
print(X.flags)
_ = Parallel(n_jobs=2, max_nbytes=1)(delayed(f)(df) for i in range(2))
<class 'numpy.memmap'>
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : False
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
<class 'numpy.memmap'>
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : False
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
_ = Parallel(n_jobs=2, max_nbytes=1)(delayed(g)(X) for i in range(2))
<class 'numpy.memmap'>
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : False
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
<class 'numpy.memmap'>
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : False
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False |
Updated PR to use This commit: 87b63ee adds scikit-learn/sklearn/ensemble/tests/test_bagging.py Lines 817 to 821 in f9321be
Fundamentally, 87b63ee can be it's own PR if we want to reduce the scope fo this PR. |
I think it can go into this PR since it's the only use case for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
# Bagging may slice `X` along the features axis before passing `X` to the base | ||
# estimators. This means that each estimator may get a different subset of | ||
# features. For bagging to slice correctly during non-fit methods, bagging | ||
# needs to set the feature names in `fit` and validate them in non-fit methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# needs to set the feature names in `fit` and validate them in non-fit methods. | |
# needs to set the feature names in `fit` and validate them in non-fit methods. | |
# This is usually done in _validate_data, but we don't want to validate data in | |
# this meta-estimator. The sub-estimators will do that. |
return list(container) | ||
else: | ||
return np.asarray(container, dtype=dtype).tolist() | ||
return np.asarray(container, dtype=dtype).tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return np.asarray(container, dtype=dtype).tolist() | |
# calling .to_list() on an ndarray returns a list of list | |
return np.asarray(container, dtype=dtype).tolist() |
@@ -444,6 +444,9 @@ Changelog | |||
for instance using cgroups quota in a docker container. :pr:`22566` by | |||
:user:`Jérémie du Boisberranger <jeremiedbb>`. | |||
|
|||
- |Fix| :class:`ensemble.BaggingRegressor` and :class:`ensemble.BaggingClassifier` | |||
sets `feature_names_in_` in `estimators_`. :pr:`21811` by `Thomas Fan`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sets `feature_names_in_` in `estimators_`. :pr:`21811` by `Thomas Fan`_. | |
set ``feature_names_in_`` in ``estimators_``. :pr:`21811` by `Thomas Fan`_. |
This is more complicated :( |
Yea this is more complex, there are a lot of common test error that arise when I am marking this PR as draft for now. |
Reference Issues/PRs
Fixes #21599
What does this implement/fix? Explain your changes.
This PR sets
feature_names_in_
for all inner estimators inBagging*
. I placed this in 1.0.2 for now, but open to move it to 1.1.Any other comments?
Should we set
feature_names_in_
for all meta-estimators that needs to validate first?