FIX Sets feature_names_in_ for estimators in Bagging* #21811

thomasjpfan · 2021-11-28T22:42:23Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR sets feature_names_in_ for all inner estimators in Bagging*. I placed this in 1.0.2 for now, but open to move it to 1.1.

Any other comments?

Should we set feature_names_in_ for all meta-estimators that needs to validate first?

thomasjpfan · 2021-11-29T19:31:28Z

Turns out this is not so simple:

Setting feature_names_in_ breaks our tests for not raising warnings in non-fit methods. When Bagging* manually sets feature_names_in_ in its child estimators, they will expect pandas dataframes. On the other hand Bagging*.predict_proba, will initially validate X and converted it into an NumPy array before calling predict_proba in the child estimators, which creates a unnecessary warning.

thomasjpfan · 2021-12-21T23:07:21Z

Moving to 1.1 since setting feature_names_in_ in estimators_ is a behavior change and does not look like a bug fix.

doc/whats_new/v1.0.rst

sklearn/ensemble/_bagging.py

sklearn/ensemble/tests/test_bagging.py

thomasjpfan · 2022-03-16T03:52:54Z

I still have the same concerns with this PR from #21811 (comment)

I think the "proper solution" is to adapt BaseBagging to use _safe_split and slice the dataframe and pass the datafarme around. From my understanding of joblib, if we do not cast to a ndarray during BaseBagging.fit then we lose the benefit of memmapping a ndarray when dispatching the parallel calls.

jeremiedbb · 2022-03-16T12:19:06Z

According to @ogrisel, memmaping should work with X being a dataframe. It would memmap the underlying numpy array. So I think it should be possible to not call _validate_data but only _check_feature_names in the bagging estimator and pass X as is in the call to parallel.

jeremiedbb · 2022-03-16T12:51:38Z

import numpy as np
import pandas as pd
from joblib import Parallel, delayed

X = np.random.RandomState(0).random_sample((100, 100))
df = pd.DataFrame(X)

def f(df):
    print(type(df.values.base.base))
    print(df.values.flags)

def g(x):
    print(type(X))
    print(X.flags)

_ = Parallel(n_jobs=2, max_nbytes=1)(delayed(f)(df) for i in range(2))

<class 'numpy.memmap'>
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

<class 'numpy.memmap'>
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

_ = Parallel(n_jobs=2, max_nbytes=1)(delayed(g)(X) for i in range(2))

<class 'numpy.memmap'>
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

<class 'numpy.memmap'>
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

…lidate

thomasjpfan · 2022-03-19T17:21:17Z

Updated PR to use _safe_indexing everywhere. I also have the bagging estimator check the n_features_in_ because it requires the same number of feature to slice correctly.

This commit: 87b63ee adds axis=1 support to list of list, because it is required by where we end up slicing a list of list with axis=1.

scikit-learn/sklearn/ensemble/tests/test_bagging.py

Lines 817 to 821 in f9321be

    
           def test_set_oob_score_label_encoding(): 
        
               # Make sure the oob_score doesn't change when the labels change 
        
               # See: https://github.com/scikit-learn/scikit-learn/issues/8933 
        
               random_state = 5 
        
               X = [[-1], [0], [1]] * 5

Fundamentally, 87b63ee can be it's own PR if we want to reduce the scope fo this PR.

sklearn/ensemble/_bagging.py

sklearn/utils/tests/test_utils.py

jeremiedbb · 2022-03-20T18:18:08Z

Fundamentally, 87b63ee can be it's own PR if we want to reduce the scope fo this PR.

I think it can go into this PR since it's the only use case for now

sklearn/utils/_testing.py

…lidate

adrinjalali

LGTM

adrinjalali · 2022-04-07T14:44:30Z

sklearn/ensemble/_bagging.py

+        # Bagging may slice `X` along the features axis before passing `X` to the base
+        # estimators. This means that each estimator may get a different subset of
+        # features. For bagging to slice correctly during non-fit methods, bagging
+        # needs to set the feature names in `fit` and validate them in non-fit methods.


Suggested change

# needs to set the feature names in `fit` and validate them in non-fit methods.

# needs to set the feature names in `fit` and validate them in non-fit methods.

# This is usually done in _validate_data, but we don't want to validate data in

# this meta-estimator. The sub-estimators will do that.

adrinjalali · 2022-04-07T15:12:32Z

sklearn/utils/_testing.py

-            return list(container)
-        else:
-            return np.asarray(container, dtype=dtype).tolist()
+        return np.asarray(container, dtype=dtype).tolist()


Suggested change

return np.asarray(container, dtype=dtype).tolist()

# calling .to_list() on an ndarray returns a list of list

return np.asarray(container, dtype=dtype).tolist()

adrinjalali · 2022-04-07T15:14:19Z

doc/whats_new/v1.1.rst

@@ -444,6 +444,9 @@ Changelog
  for instance using cgroups quota in a docker container. :pr:`22566` by
  :user:`Jérémie du Boisberranger <jeremiedbb>`.

+- |Fix| :class:`ensemble.BaggingRegressor` and :class:`ensemble.BaggingClassifier`
+  sets `feature_names_in_` in `estimators_`. :pr:`21811` by `Thomas Fan`_.


Suggested change

sets `feature_names_in_` in `estimators_`. :pr:`21811` by `Thomas Fan`_.

set ``feature_names_in_`` in ``estimators_``. :pr:`21811` by `Thomas Fan`_.

jeremiedbb · 2022-04-07T15:19:04Z

This is more complicated :(
Common tests expect specific error messages in specific order. It should be possible to do it but requires a bit more work and makes us duplicate a lot of code from _validate_data. Ideally, _validate_data should be able to let X pass through (don't convert to ndarray) but still perform some checks, like y not None for instance.

thomasjpfan · 2022-04-07T19:19:14Z

Yea this is more complex, there are a lot of common test error that arise when Bagging* does not call _validate_data.

I am marking this PR as draft for now.

FIX Sets feature_names_in_ for estimators in Bagging*

c85ac19

thomasjpfan added this to the 1.0.2 milestone Nov 28, 2021

github-actions bot added the module:ensemble label Nov 28, 2021

DOC Adds whats new PR number

5c4b5c0

thomasjpfan modified the milestones: 1.0.2, 1.1 Dec 21, 2021

jeremiedbb reviewed Mar 15, 2022

View reviewed changes

jeremiedbb added this to the 1.1 milestone Mar 15, 2022

thomasjpfan removed this from the 1.1 milestone Mar 16, 2022

thomasjpfan added 4 commits March 19, 2022 12:19

Merge remote-tracking branch 'upstream/main' into bagged_estimator_va…

fdc7281

…lidate

ENH Migrate to use _safe_indexing

415ea59

ENH Adds list of list support to _safe_indexing

87b63ee

DOC Moves whats new

8f15674

jeremiedbb reviewed Mar 20, 2022

View reviewed changes

sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved

sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved

sklearn/utils/tests/test_utils.py Show resolved Hide resolved

CLN Fixes bugs and address comments

0b3ebc3

jeremiedbb reviewed Mar 20, 2022

View reviewed changes

sklearn/utils/_testing.py Outdated Show resolved Hide resolved

Update sklearn/utils/_testing.py

dd5b2b6

jeremiedbb reviewed Mar 20, 2022

View reviewed changes

sklearn/utils/_testing.py Outdated Show resolved Hide resolved

jeremiedbb and others added 3 commits March 20, 2022 20:11

Update sklearn/utils/_testing.py

ed92d8a

Merge remote-tracking branch 'upstream/main' into bagged_estimator_va…

b9337aa

…lidate

Fix merge

a3529b3

adrinjalali approved these changes Apr 7, 2022

View reviewed changes

thomasjpfan marked this pull request as draft April 7, 2022 16:18

thomasjpfan mentioned this pull request Apr 12, 2022

SequentialFeatureSelector is not passing pandas df to estimator/pipeline #23107

Open

jeremiedbb mentioned this pull request Apr 19, 2022

_check_feature_names raises UserWarning when accessing bagged estimators #21599

Open

thomasjpfan mentioned this pull request May 12, 2022

FIX Enable SelfTrainingClassifier to work with vectorizers #23346

Draft

thomasjpfan mentioned this pull request Apr 11, 2023

RandomForest not passing feature names to trees and creating warnings. #26140

Open

	return np.asarray(container, dtype=dtype).tolist()
	# calling .to_list() on an ndarray returns a list of list
	return np.asarray(container, dtype=dtype).tolist()

	sets `feature_names_in_` in `estimators_`. :pr:`21811` by `Thomas Fan`_.
	set ``feature_names_in_`` in ``estimators_``. :pr:`21811` by `Thomas Fan`_.

Uh oh!

FIX Sets feature_names_in_ for estimators in Bagging* #21811

Are you sure you want to change the base?

FIX Sets feature_names_in_ for estimators in Bagging* #21811

Uh oh!

Conversation

thomasjpfan commented Nov 28, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

thomasjpfan commented Nov 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Dec 21, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan commented Mar 16, 2022

Uh oh!

jeremiedbb commented Mar 16, 2022

Uh oh!

jeremiedbb commented Mar 16, 2022

Uh oh!

thomasjpfan commented Mar 19, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeremiedbb commented Mar 20, 2022

Uh oh!

Uh oh!

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali Apr 7, 2022

Choose a reason for hiding this comment

Uh oh!

adrinjalali Apr 7, 2022

Choose a reason for hiding this comment

Uh oh!

adrinjalali Apr 7, 2022

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Apr 7, 2022

Uh oh!

thomasjpfan commented Apr 7, 2022

Uh oh!

Uh oh!

thomasjpfan commented Nov 29, 2021 •

edited

Loading