Thanks to visit codestin.com
Credit goes to github.com

Skip to content

type_of_target misclassifies count/ordinal regression targets as multiclass #31752

Open
@MenaWANG

Description

@MenaWANG

Describe the bug

Description:

The current implementation of type_of_target in scikit-learn classifies any 1D array of integer-like values with more than two unique values as 'multiclass', even when the data is actually count or ordinal regression (e.g., number of claims, count of participants, etc.). This can mislead users into using classification models for regression problems.

Suggestion:

  • Add logic (or at least a warning) to detect when the number of unique values is very high (e.g., >100 or a large fraction of the sample size) and the values are numeric and ordered. In such cases, suggest that the user may be dealing with a regression or count regression problem, not a multiclass classification.
  • Provide more informative warnings or guidance in the docstring and output, so users are less likely to misuse classifiers for regression targets.
    This would help prevent confusion and guide users toward the correct modelling approach.

Note:
I am willing to work on this improvement and can help with a PR if the maintainers agree this is a useful direction.

Steps/Code to Reproduce

To reproduce the issue:

import numpy as np
from sklearn.utils.multiclass import type_of_target

# 1. Generate integer data (simulating counts or ordinal data)
y_int_like = np.random.randint(low=0, high=5000, size=10000)

# 2. Cast the data to float64
y = y_int_like.astype(np.float64)

# 3. Print verification details 
print("Data type:", y.dtype)
print("Unique values count:", len(np.unique(y)))
print("Are they integer-like?", np.all(y == y.astype(int)))
print("Min value:", np.min(y))
print("Max value:", np.max(y))

# 4. Run type_of_target to demonstrate the issue
print("\nResult of type_of_target:")
target_type = type_of_target(y)
print(target_type)

Expected Results

userwarning: The input data represents integer-like values with 4331 unique values. This may be a count or ordinal regression problem, and not a multiclass classification. Consider explicitly converting your target variable to a continuous float array if it represents a continuous quantity.

Actual Results

Output from the code above

Data type: float64
Unique values count: 4331
Are they integer-like? True
Min value: 0.0
Max value: 4999.0

Result of type_of_target:
multiclass

Versions

sklearn: 1.6.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions