Exception in LogisticRegressionCV #28178

sedol1339 · 2024-01-18T18:36:29Z

Describe the bug

The code provided below raises ValueError. I guess that the problem is that minor classes may not be included in train or val sets for some folds during internal cross-validation, even with stratified split. This produces errors with some metrics other than default (accuracy).

One solution may be setting log-proba to -inf for classes not present in the train set, as well as providing label argument. How can I fix this in the most simple way?

Steps/Code to Reproduce

import numpy as np
from sklearn.linear_model import LogisticRegressionCV
X = np.zeros((10, 1))
y = [1, 1, 1, 1, 1, 2, 2, 2, 2, 3]
logreg = LogisticRegressionCV(cv=5, scoring='neg_log_loss')
logreg.fit(X, y)

Expected Results

No exception thrown

Actual Results

ValueError: y_true and y_pred contain different number of classes 2, 3. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1]

Versions

System:
    python: 3.10.8 (main, Nov  2 2023, 15:57:09) [GCC 9.4.0]
executable: /data/osedukhin/tabular-models/venv/bin/python
   machine: Linux-5.4.0-123-generic-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.4.0
          pip: 22.2.2
   setuptools: 63.2.0
        numpy: 1.26.2
        scipy: 1.11.3
       Cython: None
       pandas: 2.1.3
   matplotlib: 3.8.1
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 32
         prefix: libopenblas
       filepath: /data/osedukhin/tabular-models/venv/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: SkylakeX

       user_api: openmp
   internal_api: openmp
    num_threads: 32
         prefix: libgomp
       filepath: /data/osedukhin/tabular-models/venv/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 32
         prefix: libopenblas
       filepath: /data/osedukhin/tabular-models/venv/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: SkylakeX

The text was updated successfully, but these errors were encountered:

sedol1339 · 2024-01-18T19:03:34Z

Found the same issue some years ago: #15389

sedol1339 · 2024-01-18T19:16:30Z

@glemaitre mentioned that this can be fixed with:

scoring = make_scorer(
      log_loss, greater_is_better=False, labels=labels, needs_proba=True
)

Does labels order matter? For example, if y = [2, 2, 1, 1, 0, 0], then will passing labels=[0, 1, 2] and labels=[2, 1, 0] work correctly?

sedol1339 · 2024-01-18T19:24:31Z

log_loss documentation says that "The labels in y_pred are assumed to be ordered alphabetically, as done by LabelBinarizer". Does this mean lexicographical sorting, even for integer labels? For example, is [0, 10, 9] a correct sorting?

glemaitre · 2024-01-18T22:36:08Z

The issue is not the log_loss but rather the StratifiedKFold. Bascially on your provided y, we would get the following split:

y = np.array(y)
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5)
for train_idx, test_idx in cv.split(X, y):
     print(y[train_idx])
     print(y[test_idx])

/Users/glemaitre/Documents/packages/scikit-learn/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
[1 1 1 1 2 2 2 3]
[1 2]

[1 1 1 1 2 2 2 3]
[1 2]

[1 1 1 1 2 2 2 3]
[1 2]

[1 1 1 1 2 2 2 3]
[1 2]

[1 1 1 1 2 2 2 2]
[1 3]

All 4 first splits get all the classes at fit so the different LogisticRegression models knows that the classes are [1, 2, 3]. For the last split, the only sample of class 3 is kept for the test split. Therefore, the models are trained on two classes and thus the output of the predict_proba will be a (n_samples_test, 2) matrix instead of (n_samples, 3).

So I am not sure that we can do much here because this is a really ill-posed problem.

sedol1339 · 2024-01-19T03:27:02Z

@glemaitre is it possible to append zero output probabilities for classes not present in the train?

glemaitre · 2024-01-19T09:14:35Z

I don't see an easy way to do so without being a hack.

sedol1339 added Bug Needs Triage Issue requires triage labels Jan 18, 2024

glemaitre removed the Needs Triage Issue requires triage label Jan 18, 2024

daustria mentioned this issue Jun 28, 2024

KFold(n_samples=n) not equivalent to LeaveOneOut() cv in CalibratedClassifierCV() #29000

Closed

lionelkusch mentioned this issue Feb 14, 2025

Numpy Array Error when Training MultioutputClassifer with LogisticRegressionCV with classes underrepresented #30832

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Exception in LogisticRegressionCV #28178

Exception in LogisticRegressionCV #28178

sedol1339 commented Jan 18, 2024 •

edited

Loading

sedol1339 commented Jan 18, 2024

Uh oh!

sedol1339 commented Jan 18, 2024 •

edited

Loading

Uh oh!

sedol1339 commented Jan 18, 2024

Uh oh!

glemaitre commented Jan 18, 2024

Uh oh!

sedol1339 commented Jan 19, 2024

Uh oh!

glemaitre commented Jan 19, 2024

Uh oh!

Uh oh!

Exception in LogisticRegressionCV #28178

Exception in LogisticRegressionCV #28178

Comments

sedol1339 commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

sedol1339 commented Jan 18, 2024

Uh oh!

sedol1339 commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sedol1339 commented Jan 18, 2024

Uh oh!

glemaitre commented Jan 18, 2024

Uh oh!

sedol1339 commented Jan 19, 2024

Uh oh!

glemaitre commented Jan 19, 2024

Uh oh!

sedol1339 commented Jan 18, 2024 •

edited

Loading

sedol1339 commented Jan 18, 2024 •

edited

Loading