Description
Say you have a tie in your highest predicted probability output from the classifier with some false positives and some true positives, for example:
y_true = [ 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0]
y_pred = [.9, .9, .9, .8, .8, .7, .7, .6, .5, .4, .3]
precision, recall, thresholds = precision_recall_curve(y_true, y_pred)
Yields:
Pricision: [0.400, 0.333, 0.375, 0.428, 0.400, 0.333, 1.000]
Recall: [1.000, 0.750, 0.750, 0.750, 0.500, 0.250, 0.000]
I understand that recall goes to zero in the limit, but precision might not be so clear-cut. A really difficult problem where your top prediction is a false positive would have precision go to zero in the limit I believe. For plotting with this function, this case is probably not a big deal a vertical line from 0 to 1 would be hidden by the y-axis.
But for the drawn top-preds, as seen above, the plot becomes misleading and makes the viewer think that at least their top prediction was true positive. This may seem like a corner case, but I ran into it on a tough classification problem I was working on and was a little baffled by the output until I checked the code out.
A quick fix to preserve the intention of a clean P-R plot might be to draw a horizontal line from the highest threshold to the y-axis, this should not alter the output of most cases like here. Trying to actually calculate where it's going in the limit might be a bit overkill :-)
I would also think that adding a 1
to the end of the thresholds
vector would be helpful when plotting both precision and recall against the probabilities.
I'm happy to open a PR if anyone thinks this is worth addressing.