-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Use of X.dtype when it is not float #14314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So should we do |
I think it should be the default dtype (i.e. float64 or float) if |
Is that really that common? This was an issue in very early versions in StandardScaler and KMeans and we fixed it (by casting to float). Btw, right now int32 is cast to float64 if you pass FLOAT_DTYPES, as discussed in #15093 and I think we should fix that. Why are you saying there's no difference in precision between int32 and int64? |
I presume because the mantissa is 32 bits |
A fairly common pattern in the scikit-learn code is to create intermediary arrays with
dtype=X.dtype
. That works well, as long asX = check_arrays(X)
was run with a float dtype only, or in other words whenX.dtype
is not int.When
X.dtype
is int,check_array(X)
with default parameters will pass it though, and then all intermediary objects will be of dtype int.For instance,
(taken from the MiniBatchKMeans docstring)
will happily run in the integer space where both
sample_weight
andcluster_centroid_
will beint
. Discovered as part of #14307Another point, that e.g. in linear models,
check_array(..., dtype=[np.float64, np.float32])
will only be run ifcheck_input=True
(default). Meaning that whencheck_input=False
it might try to create a liner model where coefficients are an array of integers, and unless something fails due to a dtype mismatch the user will never know. I think we should always check thatX.dtype
is float, even whencheck_input=False
. I believe the point of that flag was to avoid expensive checks and this doesn't cost anything.In general any code that uses
X.dtype
to create intermediary arrays should be evaluated as to whether we are sure the dtype is float (or that ints would be acceptable).Might be a good sprint issue, not sure.
The text was updated successfully, but these errors were encountered: