Description
Description
I am using scikit-learn's CalibratedClassifierCV
with GaussianNB()
to run binary classification on some data. When I run .predict_proba(X_test)
, the probabilities returned for some of the samples are -inf
or inf
.
This came to light when I tried running brier_score_loss
on the resulting predictions, and it threw a ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
.
I have posted this issue as a question on StackOverflow (see this link). According to an answer there, the problem stems from some linear regression, used specifically by the 'isotonic' mode, being unable to handle extreme values.
Code to Reproduce
I have added some data to this Google drive link. It's larger than what I wanted but I couldn't get consistent reproduction with smaller datasets.
The code for reproduction lies below. There is some randomness to the code so if no infinites are found try running it again, but from my experiments it should find them on the first try.
from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np
loaded = np.load('data.npz')
X = loaded['X']
y = loaded['y']
num = 2*10**4
sss = StratifiedShuffleSplit(n_splits = 10, test_size = 0.2)
cal_classifier = CalibratedClassifierCV(GaussianNB(), method = 'isotonic', cv = sss)
classifier_fit = cal_classifier.fit(X[:num], y[:num])
predicted_probabilities = classifier_fit.predict_proba(X[num:num+num//4])[:,1]
predicted_probabilities[np.argwhere(~np.isfinite(predicted_probabilities))]
Expected Results
Probabilities calcualted lie within [0,1].
Actual Results
Some entries have infinite probabilities.
Versions
Linux-4.9.81-35.56.amzn1.x86_64-x86_64-with-glibc2.9
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0]
NumPy 1.14.0
SciPy 1.0.0
Scikit-Learn 0.19.1