Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Use of X.dtype when it is not float #14314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rth opened this issue Jul 12, 2019 · 4 comments
Open

Use of X.dtype when it is not float #14314

rth opened this issue Jul 12, 2019 · 4 comments
Labels

Comments

@rth
Copy link
Member

rth commented Jul 12, 2019

A fairly common pattern in the scikit-learn code is to create intermediary arrays with dtype=X.dtype. That works well, as long as X = check_arrays(X) was run with a float dtype only, or in other words when X.dtype is not int.

When X.dtype is int, check_array(X) with default parameters will pass it though, and then all intermediary objects will be of dtype int.

For instance,

import numpy as np
from sklearn.cluster import MiniBatchKMeans

X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 0], [4, 4],
              [4, 5], [0, 1], [2, 2],
              [3, 2], [5, 5], [1, -1]])
# manually fit on batches
kmeans = MiniBatchKMeans(n_clusters=2,
                         random_state=0,
                         batch_size=6)
kmeans.partial_fit(X[0:6,:])

(taken from the MiniBatchKMeans docstring)
will happily run in the integer space where both sample_weight and cluster_centroid_ will be int. Discovered as part of #14307

Another point, that e.g. in linear models, check_array(..., dtype=[np.float64, np.float32]) will only be run if check_input=True (default). Meaning that when check_input=False it might try to create a liner model where coefficients are an array of integers, and unless something fails due to a dtype mismatch the user will never know. I think we should always check that X.dtype is float, even when check_input=False. I believe the point of that flag was to avoid expensive checks and this doesn't cost anything.

In general any code that uses X.dtype to create intermediary arrays should be evaluated as to whether we are sure the dtype is float (or that ints would be acceptable).

Might be a good sprint issue, not sure.

@rth rth added the Bug label Jul 12, 2019
@rth rth changed the title Use of X.dtype when X is not float Use of X.dtype when it is not float Jul 12, 2019
@adrinjalali
Copy link
Member

So should we do float32 if it's int32, and float64 when it's int64?

@rth
Copy link
Member Author

rth commented Jul 13, 2019

I think it should be the default dtype (i.e. float64 or float) if X is not float. There is no difference in precision between in int32 and int64..

@amueller
Copy link
Member

In general any code that uses X.dtype to create intermediary arrays should be evaluated as to whether we are sure the dtype is float (or that ints would be acceptable).

Is that really that common? This was an issue in very early versions in StandardScaler and KMeans and we fixed it (by casting to float).

Btw, right now int32 is cast to float64 if you pass FLOAT_DTYPES, as discussed in #15093 and I think we should fix that.

Why are you saying there's no difference in precision between int32 and int64?

@jnothman
Copy link
Member

jnothman commented Oct 1, 2019

Why are you saying there's no difference in precision between int32 and int64?

I presume because the mantissa is 32 bits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants