Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Exception in LogisticRegressionCV #28178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sedol1339 opened this issue Jan 18, 2024 · 6 comments
Open

Exception in LogisticRegressionCV #28178

sedol1339 opened this issue Jan 18, 2024 · 6 comments
Labels

Comments

@sedol1339
Copy link

sedol1339 commented Jan 18, 2024

Describe the bug

The code provided below raises ValueError. I guess that the problem is that minor classes may not be included in train or val sets for some folds during internal cross-validation, even with stratified split. This produces errors with some metrics other than default (accuracy).

One solution may be setting log-proba to -inf for classes not present in the train set, as well as providing label argument. How can I fix this in the most simple way?

Steps/Code to Reproduce

import numpy as np
from sklearn.linear_model import LogisticRegressionCV
X = np.zeros((10, 1))
y = [1, 1, 1, 1, 1, 2, 2, 2, 2, 3]
logreg = LogisticRegressionCV(cv=5, scoring='neg_log_loss')
logreg.fit(X, y)

Expected Results

No exception thrown

Actual Results

ValueError: y_true and y_pred contain different number of classes 2, 3. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1]

Versions

System:
    python: 3.10.8 (main, Nov  2 2023, 15:57:09) [GCC 9.4.0]
executable: /data/osedukhin/tabular-models/venv/bin/python
   machine: Linux-5.4.0-123-generic-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.4.0
          pip: 22.2.2
   setuptools: 63.2.0
        numpy: 1.26.2
        scipy: 1.11.3
       Cython: None
       pandas: 2.1.3
   matplotlib: 3.8.1
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 32
         prefix: libopenblas
       filepath: /data/osedukhin/tabular-models/venv/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: SkylakeX

       user_api: openmp
   internal_api: openmp
    num_threads: 32
         prefix: libgomp
       filepath: /data/osedukhin/tabular-models/venv/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 32
         prefix: libopenblas
       filepath: /data/osedukhin/tabular-models/venv/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: SkylakeX
@sedol1339 sedol1339 added Bug Needs Triage Issue requires triage labels Jan 18, 2024
@sedol1339
Copy link
Author

Found the same issue some years ago: #15389

@sedol1339
Copy link
Author

sedol1339 commented Jan 18, 2024

@glemaitre mentioned that this can be fixed with:

scoring = make_scorer(
      log_loss, greater_is_better=False, labels=labels, needs_proba=True
)

Does labels order matter? For example, if y = [2, 2, 1, 1, 0, 0], then will passing labels=[0, 1, 2] and labels=[2, 1, 0] work correctly?

@sedol1339
Copy link
Author

log_loss documentation says that "The labels in y_pred are assumed to be ordered alphabetically, as done by LabelBinarizer". Does this mean lexicographical sorting, even for integer labels? For example, is [0, 10, 9] a correct sorting?

@glemaitre
Copy link
Member

The issue is not the log_loss but rather the StratifiedKFold. Bascially on your provided y, we would get the following split:

y = np.array(y)
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5)
for train_idx, test_idx in cv.split(X, y):
     print(y[train_idx])
     print(y[test_idx])
/Users/glemaitre/Documents/packages/scikit-learn/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
[1 1 1 1 2 2 2 3]
[1 2]

[1 1 1 1 2 2 2 3]
[1 2]

[1 1 1 1 2 2 2 3]
[1 2]

[1 1 1 1 2 2 2 3]
[1 2]

[1 1 1 1 2 2 2 2]
[1 3]

All 4 first splits get all the classes at fit so the different LogisticRegression models knows that the classes are [1, 2, 3]. For the last split, the only sample of class 3 is kept for the test split. Therefore, the models are trained on two classes and thus the output of the predict_proba will be a (n_samples_test, 2) matrix instead of (n_samples, 3).

So I am not sure that we can do much here because this is a really ill-posed problem.

@glemaitre glemaitre removed the Needs Triage Issue requires triage label Jan 18, 2024
@sedol1339
Copy link
Author

@glemaitre is it possible to append zero output probabilities for classes not present in the train?

@glemaitre
Copy link
Member

I don't see an easy way to do so without being a hack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants