Use of X.dtype when it is not float

A fairly common pattern in the scikit-learn code is to create intermediary arrays with `dtype=X.dtype`. That works well, as long as `X = check_arrays(X)` was run with a float dtype only, or in other words when `X.dtype` is not int.

When `X.dtype` is int, `check_array(X)` with default parameters will pass it though, and then all intermediary objects will be of dtype int. 

For instance,
```py
import numpy as np
from sklearn.cluster import MiniBatchKMeans

X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 0], [4, 4],
              [4, 5], [0, 1], [2, 2],
              [3, 2], [5, 5], [1, -1]])
# manually fit on batches
kmeans = MiniBatchKMeans(n_clusters=2,
                         random_state=0,
                         batch_size=6)
kmeans.partial_fit(X[0:6,:])
```
(taken from the [MiniBatchKMeans docstring](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html)) 
will happily run in the integer space where both `sample_weight` and `cluster_centroid_` will be `int`. Discovered as part of https://github.com/scikit-learn/scikit-learn/pull/14307

Another point, that e.g. in linear models, `check_array(..., dtype=[np.float64, np.float32])` will only be run if `check_input=True` (default). Meaning that when `check_input=False` it might try to create a liner model where coefficients are an array of integers, and unless something fails due to a dtype mismatch the user will never know. I think we should always check that `X.dtype` is float, even when `check_input=False`. I believe the point of that flag was to avoid expensive checks and this doesn't cost anything.

In general any code that uses `X.dtype` to create intermediary arrays should be evaluated as to whether we are sure the dtype is float (or that ints would be acceptable).

Might be a good sprint issue, not sure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use of X.dtype when it is not float #14314

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Use of X.dtype when it is not float #14314

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions