Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Use of X.dtype when it is not float #14314

Open
@rth

Description

@rth

A fairly common pattern in the scikit-learn code is to create intermediary arrays with dtype=X.dtype. That works well, as long as X = check_arrays(X) was run with a float dtype only, or in other words when X.dtype is not int.

When X.dtype is int, check_array(X) with default parameters will pass it though, and then all intermediary objects will be of dtype int.

For instance,

import numpy as np
from sklearn.cluster import MiniBatchKMeans

X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 0], [4, 4],
              [4, 5], [0, 1], [2, 2],
              [3, 2], [5, 5], [1, -1]])
# manually fit on batches
kmeans = MiniBatchKMeans(n_clusters=2,
                         random_state=0,
                         batch_size=6)
kmeans.partial_fit(X[0:6,:])

(taken from the MiniBatchKMeans docstring)
will happily run in the integer space where both sample_weight and cluster_centroid_ will be int. Discovered as part of #14307

Another point, that e.g. in linear models, check_array(..., dtype=[np.float64, np.float32]) will only be run if check_input=True (default). Meaning that when check_input=False it might try to create a liner model where coefficients are an array of integers, and unless something fails due to a dtype mismatch the user will never know. I think we should always check that X.dtype is float, even when check_input=False. I believe the point of that flag was to avoid expensive checks and this doesn't cost anything.

In general any code that uses X.dtype to create intermediary arrays should be evaluated as to whether we are sure the dtype is float (or that ints would be acceptable).

Might be a good sprint issue, not sure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions