Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CalibratedClassifierCV with mode = 'isotonic' has predict_proba return infinite probabilities #10903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
LotusZephyr opened this issue Apr 2, 2018 · 11 comments · Fixed by #18639
Assignees
Labels
Bug Easy Well-defined and straightforward way to resolve Sprint

Comments

@LotusZephyr
Copy link

LotusZephyr commented Apr 2, 2018

Description

I am using scikit-learn's CalibratedClassifierCV with GaussianNB() to run binary classification on some data. When I run .predict_proba(X_test), the probabilities returned for some of the samples are -inf or inf.

This came to light when I tried running brier_score_loss on the resulting predictions, and it threw a ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I have posted this issue as a question on StackOverflow (see this link). According to an answer there, the problem stems from some linear regression, used specifically by the 'isotonic' mode, being unable to handle extreme values.

Code to Reproduce

I have added some data to this Google drive link. It's larger than what I wanted but I couldn't get consistent reproduction with smaller datasets.

The code for reproduction lies below. There is some randomness to the code so if no infinites are found try running it again, but from my experiments it should find them on the first try.

from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np

loaded = np.load('data.npz')
X = loaded['X']
y = loaded['y']

num = 2*10**4
sss = StratifiedShuffleSplit(n_splits = 10, test_size = 0.2)
cal_classifier = CalibratedClassifierCV(GaussianNB(), method = 'isotonic', cv = sss)

classifier_fit = cal_classifier.fit(X[:num], y[:num])
predicted_probabilities = classifier_fit.predict_proba(X[num:num+num//4])[:,1]

predicted_probabilities[np.argwhere(~np.isfinite(predicted_probabilities))]

Expected Results

Probabilities calcualted lie within [0,1].

Actual Results

Some entries have infinite probabilities.

Versions

Linux-4.9.81-35.56.amzn1.x86_64-x86_64-with-glibc2.9
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0]
NumPy 1.14.0
SciPy 1.0.0
Scikit-Learn 0.19.1

@jnothman
Copy link
Member

jnothman commented Apr 2, 2018

Thanks for the report

@aishgrt1
Copy link
Contributor

@amueller Shall I take this ?

@amueller amueller added the Easy Well-defined and straightforward way to resolve label Aug 5, 2019
@jayybhatt
Copy link
Contributor

I can take this up. NYC WiMLDS

@jfrank94
Copy link

jfrank94 commented Aug 24, 2019

I'm going to take a stab at this and will edit this comment for updates. -NYC WiMLDS

Update #1: I'm currently scouring through both calibration.py and isotonic.py files, and I can't figure out what makes the code trigger for infinite values. For example, y_min and y_max are default set to None and the if block on isotonic.py handles for when y_min/y_max is None, set them to infinite values. I am looking through the predict_proba code for predicting probabilities, and this might be the piece of code that sets it off. I'll do more testing on predict_proba() to see if any infinite values come up. @jay-z007, any thoughts from your end?

`def predict_proba(self, X):
"""Posterior probabilities of classification

    This function returns posterior probabilities of classification
    according to each class on an array of test vectors X.

    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        The samples.

    Returns
    -------
    C : array, shape (n_samples, n_classes)
        The predicted probas. Can be exact zeros.
    """
    n_classes = len(self.classes_)
    proba = np.zeros((X.shape[0], n_classes))

    df, idx_pos_class = self._preproc(X)

    for k, this_df, calibrator in \
            zip(idx_pos_class, df.T, self.calibrators_):
        if n_classes == 2:
            k += 1
        proba[:, k] = calibrator.predict(this_df)

    # Normalize the probabilities
    if n_classes == 2:
        proba[:, 0] = 1. - proba[:, 1]
    else:
        proba /= np.sum(proba, axis=1)[:, np.newaxis]

    # XXX : for some reason all probas can be 0
    proba[np.isnan(proba)] = 1. / n_classes

    # Deal with cases where the predicted probability minimally exceeds 1.0
    proba[(1.0 < proba) & (proba <= 1.0 + 1e-5)] = 1.0

    return proba`

@amueller
Copy link
Member

@jay-z007 @jfrank94 please make sure you coordinate.

@NurzatRakhman
Copy link

I will take this one

@NurzatRakhman
Copy link

NurzatRakhman commented Jan 25, 2020

Me and @ImenRajhi, so far found out that it is caused by scipy.interpolate.interp1d() function that is used in isotonic.py in line249:

self.f_ = interpolate.interp1d(X, y, kind='linear', 
                                           bounds_error=bounds_error) 

It is caused by small values of X input. If you add fill_value="extrapolate" to that line, you would get Error: .../scipy/interpolate/interpolate.py:609: RuntimeWarning: overflow encountered in true_divide

  slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None] 

Since, (x_hi - x_lo) is super small, and dividing by it, causing overflow.

@thomasjpfan
Copy link
Member

@NurzatRakhman what versions of NumPy scipy and scikit-learn are you running?

@NurzatRakhman
Copy link

NurzatRakhman commented Jan 30, 2020

@thomasjpfan

Libraries: [this is the latest]
scipy==1.4.1
scikit-learn 0.23.dev0
numpy 1.16.5

When you run code with given data, set seed to split method to 22
sss = StratifiedShuffleSplit(n_splits = 10, test_size = 0.2, random_state=22)

Then you will see that predicted probabilities for index = [1111, 2057] return "inf" result and you can observe the behaviour above that I described.

@yashika51
Copy link
Contributor

I'm working on this issue.

@yashika51
Copy link
Contributor

Take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Easy Well-defined and straightforward way to resolve Sprint
Projects
None yet
9 participants