-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
Describe the bug
Sometimes the labels are not the same for fit_predict
and .fit.predict
(they are identical up to a permutation). cc @jeremiedbb. It would be nice if someone can try to reproduce on their machine.
Rerunning the tests multiple times it seems like the test only fails with csr_matrix and float32. This is the number of failures on 20 runs:
6 _________ test_k_means_fit_predict[4-300-0.1-csr_matrix-float32-full] __________
6 ________ test_k_means_fit_predict[3-300-1e-07-csr_matrix-float32-elkan] ________
5 _________ test_k_means_fit_predict[4-300-0.1-csr_matrix-float32-elkan] _________
5 ________ test_k_means_fit_predict[3-300-1e-07-csr_matrix-float32-full] _________
5 _________ test_k_means_fit_predict[0-2-1e-07-csr_matrix-float32-full] __________
3 _________ test_k_means_fit_predict[0-2-1e-07-csr_matrix-float32-elkan] _________
2 __________ test_k_means_fit_predict[1-2-0.1-csr_matrix-float32-elkan] __________
Probably this is a side-effect of #21052 that remove v_measure_score
. I am not too sure whether we want to put back v_measure_score
.
IIRC when I looked at #20199 my feeling was that the CI because it is using pytest-xdist
has no (or very low level of) OpenMP parallelism which makes this kind of test brittleness not visible in the CI. Edit: I double-check and in Azure there are 2 CPUs (see this log with "2 CPUs" at the beginning) so when pytest-dist
is installed we use pytest -n 2
and we don't have any OpenMP multi-threading (we use threadpoolctl
to limit OpenMP multi-threading during the tests here).
Steps/Code to Reproduce
pytest sklearn/cluster/tests/test_k_means.py -k test_k_means_fit_predict
Expected Results
no test error
Actual Results
error:
=========================================================== test session starts ============================================================
platform linux -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/lesteve/dev/scikit-learn, configfile: setup.cfg
plugins: asyncio-0.15.1, cov-2.12.1
collected 235 items / 203 deselected / 32 selected
sklearn/cluster/tests/test_k_means.py .............................F.. [100%]
================================================================= FAILURES =================================================================
_______________________________________ test_k_means_fit_predict[4-300-0.1-csr_matrix-float32-elkan] _______________________________________
algo = 'elkan', dtype = <class 'numpy.float32'>, constructor = <class 'scipy.sparse.csr.csr_matrix'>, seed = 4, max_iter = 300, tol = 0.1
@pytest.mark.parametrize("algo", ["full", "elkan"])
@pytest.mark.parametrize("dtype", [np.float32, np.float64])
@pytest.mark.parametrize("constructor", [np.asarray, sp.csr_matrix])
@pytest.mark.parametrize(
"seed, max_iter, tol",
[
(0, 2, 1e-7), # strict non-convergence
(1, 2, 1e-1), # loose non-convergence
(3, 300, 1e-7), # strict convergence
(4, 300, 1e-1), # loose convergence
],
)
def test_k_means_fit_predict(algo, dtype, constructor, seed, max_iter, tol):
# check that fit.predict gives same result as fit_predict
rng = np.random.RandomState(seed)
X = make_blobs(n_samples=1000, n_features=10, centers=10, random_state=rng)[
0
].astype(dtype, copy=False)
X = constructor(X)
kmeans = KMeans(
algorithm=algo, n_clusters=10, random_state=seed, tol=tol, max_iter=max_iter
)
labels_1 = kmeans.fit(X).predict(X)
labels_2 = kmeans.fit_predict(X)
> assert_array_equal(labels_1, labels_2)
E AssertionError:
E Arrays are not equal
E
E Mismatched elements: 700 / 1000 (70%)
E Max absolute difference: 7
E Max relative difference: 3.
E x: array([9, 0, 9, 2, 2, 9, 2, 4, 1, 7, 3, 2, 4, 3, 4, 6, 8, 9, 0, 7, 4, 0,
E 3, 7, 6, 6, 3, 7, 3, 8, 5, 9, 0, 5, 0, 3, 5, 3, 0, 5, 5, 4, 3, 8,
E 1, 0, 1, 6, 8, 5, 4, 8, 6, 7, 4, 3, 3, 4, 4, 2, 9, 8, 3, 1, 9, 0,...
E y: array([9, 2, 9, 0, 0, 9, 0, 1, 8, 7, 4, 0, 1, 4, 1, 6, 5, 9, 2, 7, 1, 2,
E 4, 7, 6, 6, 4, 7, 4, 5, 3, 9, 2, 3, 2, 4, 3, 4, 2, 3, 3, 1, 4, 5,
E 8, 2, 8, 6, 5, 3, 1, 5, 6, 7, 1, 4, 4, 1, 1, 0, 9, 5, 4, 8, 9, 2,...
sklearn/cluster/tests/test_k_means.py:355: AssertionError
============================================== 1 failed, 31 passed, 203 deselected in 42.37s ===============================================
Versions
System:
python: 3.9.5 | packaged by conda-forge | (default, Jun 19 2021, 00:32:32) [GCC 9.3.0]
executable: /home/lesteve/miniconda3/bin/python
machine: Linux-5.4.0-77-generic-x86_64-with-glibc2.31
Python dependencies:
pip: 21.1.3
setuptools: 49.6.0.post20210108
sklearn: 1.1.dev0
numpy: 1.19.5
scipy: 1.7.0
Cython: 0.29.23
pandas: 1.3.0
matplotlib: 3.4.2
joblib: 1.0.1
threadpoolctl: 2.1.0
Built with OpenMP: True