-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Many redundant prediction probabilities for test instances with sparse SVM #3266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am guessing you're hitting the issue that scipy sparse matrices indices are in32 so have an overflow... I am not sure of the status of scipy.sparse with int64 indices. I would use dim reduction with univariate feature selection to avoid it. |
|
I don't think that's the issue, as I have 20k training instances, and almost all of them have < 5k features that are populated, with the majority having only hundreds populated. There are 4 million total features, so I could understand if the vectors weren't so sparse, but since a majority of that 4 million is zeroes, I don't see that approaching those limits. |
I don't believe that #3268 is a fix for the particular issue I'm seeing, as the issue lies not in the roc_curve function, but rather in the predict_proba function for sparse SVM. The ROC calculation is correct, it is the predictions that seem to have an issue. |
The |
To clarify this point: do you average the ROC curves of the 5CV or do you concatenate the test predictions of the 5 folds before computing the roc curve? |
That plot was produced by concatenating the probability predictions for all instances, as I was first alerted to the problem after averaging the ROC curves for each fold. |
@cbmeyer can you isolate the problem by computing the probabilities with the |
I am not sure whether we can fix the randomness of the internal CV used by libsvm to perform the Platt scaling though. |
What is your feeling about this @hamsal? |
There is probably a bug in |
Hum, that is not good :-/ I'll try to have a look later today. |
I can not reproduce the step function behavior. Can you get that behavior with a dataset you can share? Or can you share your input matrices? |
I cannot reproduce either. Here is the script I used: import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.metrics import roc_curve
from sklearn.cross_validation import StratifiedKFold
from sklearn.preprocessing import scale
from sklearn import svm
digits = load_digits()
X_scaled = scale(digits.data)
y = digits.target == 0
y[np.random.rand(y.shape[0]) > 0.9] = 1
C_val = 100
gamma_val = 1e-6
skf = StratifiedKFold(y, n_folds=5)
predictions = "predict_proba"
for fold, (train_index, test_index) in enumerate(skf):
# split the training and testing sets
X_train, X_test = X_scaled[train_index], X_scaled[test_index]
y_train, y_test = y[train_index], y[test_index]
# train on the subset for this fold
print('Training on fold ' + str(fold))
classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val, probability=True)
if predictions == 'predict_proba':
y_pred = classifier.fit(X_train, y_train).predict_proba(X_test)[:, 1]
elif predictions == 'decision_function':
y_pred = classifier.fit(X_train, y_train).decision_function(X_test)
else:
raise ValueError(predictions)
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.plot(fpr, tpr, label='ROC curve of fold {0}'.format(fold))
plt.title("ROC curves computed with " + predictions)
plt.legend(loc='best')
plt.show() and the outcome with predict proba: with decision function: |
I retried with sparse input as follows: import numpy as np
from scipy.sparse import csr_matrix
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.metrics import roc_curve
from sklearn.cross_validation import StratifiedKFold
#from sklearn.preprocessing import scale
from sklearn import svm
digits = load_digits()
X_scaled = csr_matrix(digits.data)
y = digits.target == 0
y[np.random.rand(y.shape[0]) > 0.9] = 1
C_val = 100
gamma_val = 1e-6
skf = StratifiedKFold(y, n_folds=5)
predictions = "predict_proba"
for fold, (train_index, test_index) in enumerate(skf):
# split the training and testing sets
X_train, X_test = X_scaled[train_index], X_scaled[test_index]
y_train, y_test = y[train_index], y[test_index]
# train on the subset for this fold
print('Training on fold ' + str(fold))
classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val, probability=True)
if predictions == 'predict_proba':
y_pred = classifier.fit(X_train, y_train).predict_proba(X_test)[:, 1]
elif predictions == 'decision_function':
y_pred = classifier.fit(X_train, y_train).decision_function(X_test)
else:
raise ValueError(predictions)
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.plot(fpr, tpr, label='ROC curve of fold {0}'.format(fold))
plt.title("ROC curves computed with " + predictions)
plt.legend(loc='best')
plt.show() the output looks very similar. |
Sorry I did not mean to click on the close button... |
Here is a link to the feature vectors, in LIBSVM format: https://www.dropbox.com/s/ovqg5t3v0o0mnn8/sampled_round2.data I should also note that I'm scaling the data like so:
|
Thanks :) |
@cbmeyer what where your gamma and C? (sorry for the very slow reply) |
Also reproducing with a smaller dataset would be helpful, this one is huge (which is why I didn't do a grid-search ;) |
The C value is 94.1605007514 and gamma value is 0.00456579534043. I'll see if I can put together a smaller dataset that will reproduce the issue. |
I used these parameters and could reproduce a lot of duplicate probability estimates. |
Interesting, the plot I posted above on June 25 uses the -b flag in libsvm, and shows drastically different results. I'm not sure how you got the same result output, but mine is definitely much different (see above post). Just to reiterate from the first post, the reason I discovered this behavior was that the decision_function() method is not supported for sparse vectors, so I was forced to used this route to produce results. If the I could use distance from the boundary instead, it would be a far better solution than incurring the overhead of the Platt scaling. |
Indeed. If you feel motivated a PR to add support for this would be very well received I think (with tests and all). Coming back to the original issue, let us know if you can find a tuple of (dataset, C, gamma) that yields very distinct behaviors between sklearn and libsvm. Otherwise I think we should close this issue. |
The dataset, C, and gamma I've provided produced drastically different results for me between sklearn and libsvm, as outlined in my earlier posts. |
@cbmeyer I am working on implementing the decision_function, but I can't promise anything. Without knowing how you split or scaled the data I can not reproduce your results. |
I used libsvm-3.12-1 as bundled with Ubuntu (I think, I won't get to the laptop for a week now). |
I'm using libsvm 3.18. For doing the scaling and cross fold splits, I do something like this in a bash script:
|
Here's a more complete version of my sklearn Python code:
|
Thanks. |
hm not sure |
Thanks for the input, I'll try loading in the libsvm scaled data directly and see how that fares. |
After feeding the libsvm scaled data in directly, looks like I get the prediction values I expect, so the StandardScaler looks like the culprit. Thanks for the input, this was very helpful! |
Hi, I am using 0.15.2 version.
I took the data from the following Kaggle challenge: http://www.kaggle.com/c/datasciencebowl Per my reading online from scikit documentation and references, I used the below code to extract features, train/test and predict probabilities for the test data set provided. #Coding Starts #Variables total_train_list=[] #Imports #Nudge data by convolving
#Extract Train Data total_train_array = np.asarray(total_train_list, 'float32') #Scale the data #Instantiate model pipeline classifier = Pipeline(steps=[('rbm', rbm), ('svm', svc)]) Trainingrbm.learning_rate = 0.06 print "Model Building Started" Training RBM-SVM Pipelineclassifier.fit(nudged_scaled_train, nudged_target) print "Model Building Completed" joblib.dump(classifier, '/home/ubuntu/kaggle/nationaldatasciencebowl-feb2015/model/kaggle-plankton-cnn-svm-1.pkl') #Extract Test Data print "Feature extraction on Test data Started" #Scale Test Data #Load the model #Predict the probabilities and class dist_target = set(target) planktons_output_class=[] df1 = pandas.DataFrame(planktons_output) df1.to_csv("planktons_output.csv") The output probabilities predicted by predict_proba is marginally different but they are broadly the same for the first 6 - 7 decimal values. 3980.jpg,0.029336030196937507,0.0004290310973272736,.......(other probs for other classes) It would also be helpful to know the class name for each of the probability predicted. Is it possible to tag the class name associated with each probability predicted above? Please help me understand. Any help would be much appreciated. |
Can you be more specific about what your problem is? |
Thanks for the note. |
I'm not sure what you mean by Well, as mentioned in many places, the predict_proba method uses platt scaling, which is not great. |
I’m having an issue using the prediction probabilities for sparse SVM, where many of the predictions come out the same for my test instances. These probabilities are produced during cross validation, and when I plot an ROC curve for the folds, the results look very strange, as there are a handful of clustered points on the graph. Here is my cross validation code, I based it off of the samples on the scikit website:
I’m just trying to figure out if there’s something I’m obviously missing here, since I used this same training set and SVM parameters with libsvm and got much better results. When I used libsvm and printed out the distances from the hyperplane for the CV test instances and then plotted the ROC, it came out much more like I expected, and a much better AUC. Since the decision_function() method is not supported for sparse matrices, I cannot recreate this functionality in scikit, and therefore have to rely on the prediction probabilities.
There are 20k instances total, 10k positive and 10k negative, and I'm using 5-fold cross-validation. In the cross-validation results, there are several prediction values for which there are 1k-2k samples that all have the same prediction value, and there are only 3600 distinct prediction values over all of the folds for cross-validation. The resulting ROC looks like five big stair steps, with some little bits of fuzziness around the inner corners.
I have many sparse features, so I'm hashing those into index ranges for different types of feature subsets, so one feature subset will be in the index range 1 million to 2 million, the next will be in the range 2 million to 3 million, etc.
The text was updated successfully, but these errors were encountered: