Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Error on the scikit-learn algorithm cheat-sheet? #30076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pstingley opened this issue Oct 15, 2024 · 7 comments
Closed

Error on the scikit-learn algorithm cheat-sheet? #30076

pstingley opened this issue Oct 15, 2024 · 7 comments

Comments

@pstingley
Copy link

Describe the bug

In Clustering, if there are <10K samples, shouldn't yes go to Tough Luck (because there aren't enough samples), and no, go to MeanShift/VBGMM (because there are)?

Steps/Code to Reproduce

N/A

Expected Results

N/A

Actual Results

N/A

Versions

# N/A
@pstingley pstingley added Bug Needs Triage Issue requires triage labels Oct 15, 2024
@adrinjalali
Copy link
Member

I think you're right (although I personally think 10k is a very large number and with low number of features clustering couple 100 samples also makes sense). Feel free to submit a pull request with the fix.

@adrinjalali adrinjalali added Documentation and removed Bug Needs Triage Issue requires triage labels Oct 16, 2024
@adrinjalali
Copy link
Member

cc @Charlie-XIAO maybe.

@Charlie-XIAO
Copy link
Contributor

Indeed that's my bad. Fixing that SVG is a bit complicated, maybe I'll do that directly.

@Charlie-XIAO
Copy link
Contributor

Charlie-XIAO commented Oct 16, 2024

Well I double checked and saw that previous "no" also points to tough luck: https://scikit-learn.org/1.4/tutorial/machine_learning_map/index.html so it's actually not my typo. I'm thinking maybe it means that MeanShift and Variational BGM models are not suitable for large number of samples? (I'm not so familiar with those algorithms though.) @adrinjalali

On second thought I actually don't think <10K samples would be tough luck... Looking at the starting point of the graph there is actually an arrow pointing to "get more data" when we have less than 50 samples.

@adrinjalali
Copy link
Member

Oh that might be true. We'd need to check the implementation and see if that's true. Also, it might be the case that with the new hardware we have, the number threshold is quite a bit higher.

@virchan
Copy link
Member

virchan commented Oct 17, 2024

TL;DR: I believe the cheat sheet is correct.

Suppose we are working with an unlabelled dataset with an unknown number of clusters.

For sample_size >= 10K, most scikit-learn clustering algorithms become computationally expensive, so the "tough luck" conclusion seems reasonable. I believe this is what the original author intended.

In the case of 50 < sample_size < 10K, I don't see any issues with using scikit-learn’s MeanShift or VBGMM. Both algorithms are designed for such scenarios. At worst, they can serve as brute-force solutions if no better options are available.

@adrinjalali
Copy link
Member

Thanks everyone for the discussions, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants