Description
A fairly common pattern in the scikit-learn code is to create intermediary arrays with dtype=X.dtype
. That works well, as long as X = check_arrays(X)
was run with a float dtype only, or in other words when X.dtype
is not int.
When X.dtype
is int, check_array(X)
with default parameters will pass it though, and then all intermediary objects will be of dtype int.
For instance,
import numpy as np
from sklearn.cluster import MiniBatchKMeans
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 0], [4, 4],
[4, 5], [0, 1], [2, 2],
[3, 2], [5, 5], [1, -1]])
# manually fit on batches
kmeans = MiniBatchKMeans(n_clusters=2,
random_state=0,
batch_size=6)
kmeans.partial_fit(X[0:6,:])
(taken from the MiniBatchKMeans docstring)
will happily run in the integer space where both sample_weight
and cluster_centroid_
will be int
. Discovered as part of #14307
Another point, that e.g. in linear models, check_array(..., dtype=[np.float64, np.float32])
will only be run if check_input=True
(default). Meaning that when check_input=False
it might try to create a liner model where coefficients are an array of integers, and unless something fails due to a dtype mismatch the user will never know. I think we should always check that X.dtype
is float, even when check_input=False
. I believe the point of that flag was to avoid expensive checks and this doesn't cost anything.
In general any code that uses X.dtype
to create intermediary arrays should be evaluated as to whether we are sure the dtype is float (or that ints would be acceptable).
Might be a good sprint issue, not sure.