Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #13199

What does this implement/fix? Explain your changes.

When input_type="string", FeatureHasher expects a list of a list of strings. This PR raises an error if the input is just a list of strings.

Any other comments?

I am open to deprecating as suggested in #13199 (comment), but from reading the original issue I think the current behavior is a bug.

@Micky774
Copy link
Contributor

Micky774 commented Dec 1, 2022

Since we check only the first element, a list such as [["my_string"], "another_string"] is considered valid, interpreting "another_string" as a list of 1-character strings. Is this intentional, or should we validate each element? If intentional, could we make this explicit in documentation

@betatim
Copy link
Member

betatim commented Dec 2, 2022

I think [["my_string"], "another_string"] is invalid. For me the question is if the validation should be exhaustive or "best effort" (this PR). It seems a likely cause of a user mistake is using a sequence of strings, instead of a sequence of sequences of strings. This means that checking the first item is enough to catch this common mistake.

I think I'd be ok with just checking the first instance. See if we get more bug reports and then act

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a useful improvement.

@jeremiedbb
Copy link
Member

I am open to deprecating as suggested in #13199 (comment), but from reading the original issue I think the current behavior is a bug.

I'm okay to consider it a bugfix

@jeremiedbb
Copy link
Member

I think I'd be ok with just checking the first instance.

That's what we usually do in such cases.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeremiedbb jeremiedbb enabled auto-merge (squash) January 3, 2023 17:20
@jeremiedbb jeremiedbb merged commit 2931760 into scikit-learn:main Jan 3, 2023
jpangas pushed a commit to jpangas/scikit-learn that referenced this pull request Jan 4, 2023
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Using ColumnTransformer with FeatureHasher(string) hashes characters instead of strings

5 participants