Error on the scikit-learn algorithm cheat-sheet? #30076

pstingley · 2024-10-15T19:00:45Z

Describe the bug

In Clustering, if there are <10K samples, shouldn't yes go to Tough Luck (because there aren't enough samples), and no, go to MeanShift/VBGMM (because there are)?

Steps/Code to Reproduce

N/A

Expected Results

N/A

Actual Results

N/A

Versions

# N/A

adrinjalali · 2024-10-16T06:39:18Z

I think you're right (although I personally think 10k is a very large number and with low number of features clustering couple 100 samples also makes sense). Feel free to submit a pull request with the fix.

adrinjalali · 2024-10-16T06:39:37Z

cc @Charlie-XIAO maybe.

Charlie-XIAO · 2024-10-16T18:53:56Z

Indeed that's my bad. Fixing that SVG is a bit complicated, maybe I'll do that directly.

Charlie-XIAO · 2024-10-16T23:50:08Z

Well I double checked and saw that previous "no" also points to tough luck: https://scikit-learn.org/1.4/tutorial/machine_learning_map/index.html so it's actually not my typo. I'm thinking maybe it means that MeanShift and Variational BGM models are not suitable for large number of samples? (I'm not so familiar with those algorithms though.) @adrinjalali

On second thought I actually don't think <10K samples would be tough luck... Looking at the starting point of the graph there is actually an arrow pointing to "get more data" when we have less than 50 samples.

adrinjalali · 2024-10-17T06:50:14Z

Oh that might be true. We'd need to check the implementation and see if that's true. Also, it might be the case that with the new hardware we have, the number threshold is quite a bit higher.

virchan · 2024-10-17T17:17:43Z

TL;DR: I believe the cheat sheet is correct.

Suppose we are working with an unlabelled dataset with an unknown number of clusters.

For sample_size >= 10K, most scikit-learn clustering algorithms become computationally expensive, so the "tough luck" conclusion seems reasonable. I believe this is what the original author intended.

In the case of 50 < sample_size < 10K, I don't see any issues with using scikit-learn’s MeanShift or VBGMM. Both algorithms are designed for such scenarios. At worst, they can serve as brute-force solutions if no better options are available.

adrinjalali · 2024-10-18T06:23:56Z

Thanks everyone for the discussions, closing.

pstingley added Bug Needs Triage Issue requires triage labels Oct 15, 2024

adrinjalali added Documentation and removed Bug Needs Triage Issue requires triage labels Oct 16, 2024

adrinjalali closed this as completed Oct 18, 2024

virchan mentioned this issue Nov 18, 2024

Crying emoticon in "Choosing the right estimator" does not work for most audiences #30283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Error on the scikit-learn algorithm cheat-sheet? #30076

Error on the scikit-learn algorithm cheat-sheet? #30076

pstingley commented Oct 15, 2024

adrinjalali commented Oct 16, 2024

Uh oh!

adrinjalali commented Oct 16, 2024

Uh oh!

Charlie-XIAO commented Oct 16, 2024

Uh oh!

Charlie-XIAO commented Oct 16, 2024 •

edited

Loading

Uh oh!

adrinjalali commented Oct 17, 2024

Uh oh!

virchan commented Oct 17, 2024

Uh oh!

adrinjalali commented Oct 18, 2024

Uh oh!

Uh oh!

Error on the scikit-learn algorithm cheat-sheet? #30076

Error on the scikit-learn algorithm cheat-sheet? #30076

Comments

pstingley commented Oct 15, 2024

Describe the bug

Steps/Code to Reproduce

N/A

Expected Results

N/A

Actual Results

N/A

Versions

adrinjalali commented Oct 16, 2024

Uh oh!

adrinjalali commented Oct 16, 2024

Uh oh!

Charlie-XIAO commented Oct 16, 2024

Uh oh!

Charlie-XIAO commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Oct 17, 2024

Uh oh!

virchan commented Oct 17, 2024

Uh oh!

adrinjalali commented Oct 18, 2024

Uh oh!

Charlie-XIAO commented Oct 16, 2024 •

edited

Loading