Description
Describe the bug
Description:
The current implementation of type_of_target
in scikit-learn classifies any 1D array of integer-like values with more than two unique values as 'multiclass', even when the data is actually count or ordinal regression (e.g., number of claims, count of participants, etc.). This can mislead users into using classification models for regression problems.
Suggestion:
- Add logic (or at least a warning) to detect when the number of unique values is very high (e.g., >100 or a large fraction of the sample size) and the values are numeric and ordered. In such cases, suggest that the user may be dealing with a regression or count regression problem, not a multiclass classification.
- Provide more informative warnings or guidance in the docstring and output, so users are less likely to misuse classifiers for regression targets.
This would help prevent confusion and guide users toward the correct modelling approach.
Note:
I am willing to work on this improvement and can help with a PR if the maintainers agree this is a useful direction.
Steps/Code to Reproduce
To reproduce the issue:
import numpy as np
from sklearn.utils.multiclass import type_of_target
# 1. Generate integer data (simulating counts or ordinal data)
y_int_like = np.random.randint(low=0, high=5000, size=10000)
# 2. Cast the data to float64
y = y_int_like.astype(np.float64)
# 3. Print verification details
print("Data type:", y.dtype)
print("Unique values count:", len(np.unique(y)))
print("Are they integer-like?", np.all(y == y.astype(int)))
print("Min value:", np.min(y))
print("Max value:", np.max(y))
# 4. Run type_of_target to demonstrate the issue
print("\nResult of type_of_target:")
target_type = type_of_target(y)
print(target_type)
Expected Results
userwarning: The input data represents integer-like values with 4331 unique values. This may be a count or ordinal regression problem, and not a multiclass classification. Consider explicitly converting your target variable to a continuous float array if it represents a continuous quantity.
Actual Results
Output from the code above
Data type: float64
Unique values count: 4331
Are they integer-like? True
Min value: 0.0
Max value: 4999.0
Result of type_of_target:
multiclass
Versions
sklearn: 1.6.1