Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ValueError: "Unknown label type: 'unknown'" when class column has Pandas type like Int64 #25073

Closed
@Ark-kun

Description

@Ark-kun

Describe the bug

I use Pandas to load data from CSV and transform it.
Pandas often parses integer columns as float, so I usually use df = df.convert_dtypes() to bring those columsn back to int.
It looks like this causes Pandas to make all integer columns Int64.

When I try to train some Scikit-Learn models like LogisticRegression on such data I get error ValueError: Unknown label type: 'unknown'.

I think there was some effort to prevent this issue, I see it. https://github.com/scikit-learn/scikit-learn/blame/bb080aa690364d84d11232c73dc8db2f0dde3578/sklearn/utils/validation.py#L796

Steps/Code to Reproduce

import sklearn
import pandas

df = pandas.DataFrame({"class": [0, 1, 0, 1, 1], "feature_1": [0.1, 0.2, 0.3, 0.4, 0.5]})
df = df.convert_dtypes()
model = sklearn.linear_model.LogisticRegression()
model.fit(
    X=df.drop(columns="class"),
    y=df["class"],
)

Expected Results

I expect the model to be trained.

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [40], line 10
      8 df2.info()
      9 model = sklearn.linear_model.LogisticRegression()
---> 10 model.fit(
     11     X=df2.drop(columns="class"),
     12     y=df2["class"],
     13 )

File /opt/conda/envs/python3.9/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:1146, in LogisticRegression.fit(self, X, y, sample_weight)
   1136     _dtype = [np.float64, np.float32]
   1138 X, y = self._validate_data(
   1139     X,
   1140     y,
   (...)
   1144     accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
   1145 )
-> 1146 check_classification_targets(y)
   1147 self.classes_ = np.unique(y)
   1149 multi_class = _check_multi_class(self.multi_class, solver, len(self.classes_))

File /opt/conda/envs/python3.9/lib/python3.9/site-packages/sklearn/utils/multiclass.py:200, in check_classification_targets(y)
    192 y_type = type_of_target(y, input_name="y")
    193 if y_type not in [
    194     "binary",
    195     "multiclass",
   (...)
    198     "multilabel-sequences",
    199 ]:
--> 200     raise ValueError("Unknown label type: %r" % y_type)

ValueError: Unknown label type: 'unknown'

Versions

System:
    python: 3.9.0 | packaged by conda-forge | (default, Nov 26 2020, 07:57:39)  [GCC 9.3.0]
executable: /opt/conda/envs/python3.9/bin/python3.9
   machine: Linux-4.19.0-10-cloud-amd64-x86_64-with-glibc2.28

Python dependencies:
      sklearn: 1.1.3
          pip: 22.3
   setuptools: 65.5.0
        numpy: 1.21.5
        scipy: 1.7.3
       Cython: 0.29.32
       pandas: 1.4.3
   matplotlib: 3.6.1
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /opt/conda/envs/python3.9/lib/python3.9/site-packages/numpy.libs/libopenblasp-r0-2d23e62b.3.17.so
        version: 0.3.17
threading_layer: pthreads
   architecture: Haswell
    num_threads: 4

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /opt/conda/envs/python3.9/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 4

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /opt/conda/envs/python3.9/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-8b9e111f.3.17.so
        version: 0.3.17
threading_layer: pthreads
   architecture: Haswell
    num_threads: 4

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions