Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

MaxwellLZH
Copy link
Contributor

Reference Issues/PRs

Part of issue #21308 . This is the same PR as #21459 (I got into some git issue with the last PR )

What does this implement/fix? Explain your changes.

Implementing get_feature_names_out for RandomForestEmbedding, VotingClassifier, VotingRegressor, StackingClassifier and StackingRegressor, with corresponding test cases.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @MaxwellLZH ! I recommend breaking this PR into 3 PRs:

  1. Keep this one for RandomTreesEmbedding.
  2. Another for Voting*
  3. Another for Stacking* (Do not start this one until Voting* is complete since they are related and will have the same discussions)

This is because I think there is an argument for generating more descriptive names for each of the cases above.

Comment on lines 1807 to 1809
assert_array_equal(
[f"randomtreesembedding{i}" for i in range(hasher._n_features_out)], names
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to explicitly test the public API. We can transform and get the number of features out:

    n_features_out = hasher.transform(X).shape[1]
    assert_array_equal(
        [f"randomtreesembedding{i}" for i in range(n_features_out)], names
    )

There is an argument for using something like randomtreesembedding_3_10, where 3 is represents the tree that used to generate the leaf, and 10 is the leaf index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I leave the naming as it is for now? if we decided to go for randomtreesembedding_{i}_{j} then we can change the test cases accordingly later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make a decision in this PR. I like using randomtreesembedding_{i}_{j}, can we update this PR to use this formatting and see what other reviewers think?

This means a custom get_feature_names_out for tree embedding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a custom get_feature_names_out for tree embedding, where i is tree index starting from 1 and j is leaf index as suggested.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an entry to the change log at doc/whats_new/v1.1.rst with tag |Enhacement|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I am happy with the naming for the features here. Let's see what other reviewers think.

@thomasjpfan thomasjpfan changed the title ENH Add get_feature_names_out for ensemble module ENH Add get_feature_names_out for RandomTreesEmbedding module Feb 15, 2022
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the suggestions below as taken into account. I found the feature names surprising so I think it's was necessary to make the docstring more explicit and add an inline comment in the test.

@thomasjpfan
Copy link
Member

I found the feature names surprising so I think it's was necessary to make the docstring more explicit and add an inline comment in the test.

Do you think it is better to ignore the internal indices from the trees and use 0, 1, 2, etc for the leaf indices?

I am okay with that option as well.

@ogrisel
Copy link
Member

ogrisel commented Mar 7, 2022

Do you think it is better to ignore the internal indices from the trees and use 0, 1, 2, etc for the leaf indices?

The current implementation is more informative but potentially a bit confusing. Using contiguous, leaf-only indexing would make the feature names less dependent on the internal tree data-structure but in a way this internal detail is already part of the public API because those are the indices returned by the apply public method.

So +0 for keeping the current indexing / naming scheme.

@jeremiedbb
Copy link
Member

Do you think it is better to ignore the internal indices from the trees and use 0, 1, 2, etc for the leaf indices?

I don't have a strong preference. I'm fine with reflecting the tree structure with the additional comments from Olivier. It might make it easier to debug if needed as well.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @MaxwellLZH

@jeremiedbb jeremiedbb merged commit 26f5b26 into scikit-learn:main Mar 8, 2022
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022
…-learn#21762)

Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Jérémie du Boisberranger <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants