Closed
Description
Describe the bug
I use Pandas to load data from CSV and transform it.
Pandas often parses integer columns as float, so I usually use df = df.convert_dtypes()
to bring those columsn back to int.
It looks like this causes Pandas to make all integer columns Int64
.
When I try to train some Scikit-Learn models like LogisticRegression
on such data I get error ValueError: Unknown label type: 'unknown'
.
I think there was some effort to prevent this issue, I see it. https://github.com/scikit-learn/scikit-learn/blame/bb080aa690364d84d11232c73dc8db2f0dde3578/sklearn/utils/validation.py#L796
Steps/Code to Reproduce
import sklearn
import pandas
df = pandas.DataFrame({"class": [0, 1, 0, 1, 1], "feature_1": [0.1, 0.2, 0.3, 0.4, 0.5]})
df = df.convert_dtypes()
model = sklearn.linear_model.LogisticRegression()
model.fit(
X=df.drop(columns="class"),
y=df["class"],
)
Expected Results
I expect the model to be trained.
Actual Results
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [40], line 10
8 df2.info()
9 model = sklearn.linear_model.LogisticRegression()
---> 10 model.fit(
11 X=df2.drop(columns="class"),
12 y=df2["class"],
13 )
File /opt/conda/envs/python3.9/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1146, in LogisticRegression.fit(self, X, y, sample_weight)
1136 _dtype = [np.float64, np.float32]
1138 X, y = self._validate_data(
1139 X,
1140 y,
(...)
1144 accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
1145 )
-> 1146 check_classification_targets(y)
1147 self.classes_ = np.unique(y)
1149 multi_class = _check_multi_class(self.multi_class, solver, len(self.classes_))
File /opt/conda/envs/python3.9/lib/python3.9/site-packages/sklearn/utils/multiclass.py:200, in check_classification_targets(y)
192 y_type = type_of_target(y, input_name="y")
193 if y_type not in [
194 "binary",
195 "multiclass",
(...)
198 "multilabel-sequences",
199 ]:
--> 200 raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'unknown'
Versions
System:
python: 3.9.0 | packaged by conda-forge | (default, Nov 26 2020, 07:57:39) [GCC 9.3.0]
executable: /opt/conda/envs/python3.9/bin/python3.9
machine: Linux-4.19.0-10-cloud-amd64-x86_64-with-glibc2.28
Python dependencies:
sklearn: 1.1.3
pip: 22.3
setuptools: 65.5.0
numpy: 1.21.5
scipy: 1.7.3
Cython: 0.29.32
pandas: 1.4.3
matplotlib: 3.6.1
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /opt/conda/envs/python3.9/lib/python3.9/site-packages/numpy.libs/libopenblasp-r0-2d23e62b.3.17.so
version: 0.3.17
threading_layer: pthreads
architecture: Haswell
num_threads: 4
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /opt/conda/envs/python3.9/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
num_threads: 4
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /opt/conda/envs/python3.9/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-8b9e111f.3.17.so
version: 0.3.17
threading_layer: pthreads
architecture: Haswell
num_threads: 4