Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

jeremiedbb
Copy link
Member

This reverts commit 7f26920.

murmurhash3 is used by other projects like skrub, and deprecating it has a huge cost for them:

  • either add a new dependency (like mmh3)
  • or implement it and have to compile c++ code making it not pure python anymore

Given that we're keeping the cython implementation and that #32103 was just about removing the public name, the tradeoff benefit for sklearn downside for other projects doesn't seem worth.

On a longer term perspective, we could discuss using a c++ implementation from the the stdlib as explained here #27593 (comment). Then the benefit for sklearn would be significant enough to reconsider deprecating murmuhash3.

Copy link

github-actions bot commented Sep 8, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: bbcd695. Link to the linter CI: here

Copy link
Member

@virchan virchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks, @jeremiedbb!

Cc'ing @ogrisel and @lorentzenchr in advance.

@virchan virchan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Sep 8, 2025
@adam2392
Copy link
Member

On a longer term perspective, we could discuss using a c++ implementation from the the stdlib as explained here #27593 (comment). Then the benefit for sklearn would be significant enough to reconsider deprecating murmuhash3.

If something exists in STL for Python, or C++, I suppose, why not use it? I'm not too aware of the issues/nuance here, so if you want to educate me, I'll be happy to learn.

If there is no such reason tho, we should just go for standard implementations to prevent re-inventing the wheel. In the tree/ module, we also had removed a lot of code after Cython allowed cimports of a wider range of C++ STL.

@adam2392
Copy link
Member

But yeah this also LGTM based on just the merits of the PR itself

@ogrisel
Copy link
Member

ogrisel commented Sep 15, 2025

If something exists in STL for Python, or C++, I suppose, why not use it? I'm not too aware of the issues/nuance here, so if you want to educate me, I'll be happy to learn.

I think the C++ lib provides an implementation of murmurhash 2 instead of murmurhash 3 we currently use in scikit-learn. We would have to check the performance impact of switching back to 2. Maybe it does not matter. Changing the hash function used in hashing vectorizers will introduce a behavior change (features will be mapped to different columns) but I think it's ok because they are arbitrary anyways.

@ogrisel ogrisel merged commit 396edeb into scikit-learn:main Sep 15, 2025
46 of 47 checks passed
cakedev0 pushed a commit to cakedev0/scikit-learn that referenced this pull request Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants