precision_recall_curve - assumed limits can be misleading

Referencing [`"The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the x axis."`](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/ranking.py#L346-L348) 

Say you have a tie in your highest predicted probability output from the classifier with some false positives and some true positives, for example:

```
y_true = [ 1,  0,  0,  1,  0,  1,  0,  0,  0,  1,  0]
y_pred = [.9, .9, .9, .8, .8, .7, .7, .6, .5, .4, .3]
precision, recall, thresholds = precision_recall_curve(y_true, y_pred)
```

Yields:

```
Pricision: [0.400, 0.333, 0.375, 0.428, 0.400, 0.333, 1.000]
Recall:    [1.000, 0.750, 0.750, 0.750, 0.500, 0.250, 0.000]
```

![download 1](https://cloud.githubusercontent.com/assets/5210848/6098145/7fb45fec-af8a-11e4-9e21-7549071fb2b3.png)

I understand that recall goes to zero in the limit, but precision might not be so clear-cut. A _really_ difficult problem where your top prediction is a false positive would have precision go to zero in the limit I believe. For plotting with this function, this case is probably not a big deal a vertical line from 0 to 1 would be hidden by the y-axis. 

But for the drawn top-preds, as seen above, the plot becomes misleading and makes the viewer think that at least their top prediction was true positive. This may seem like a corner case, but I ran into it on a tough classification problem I was working on and was a little baffled by the output until I checked the code out.

A quick fix to preserve the intention of a clean P-R plot might be to draw a horizontal line from the highest threshold to the y-axis, this should not alter the output of most cases like [here](http://scikit-learn.org/stable/auto_examples/plot_precision_recall.html). Trying to actually calculate where it's going in the limit might be a bit overkill :-)

I would also think that adding a `1` to the end of the `thresholds` vector would be helpful when plotting both precision and recall against the probabilities.

I'm happy to open a PR if anyone thinks this is worth addressing.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

precision_recall_curve - assumed limits can be misleading #4223

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

precision_recall_curve - assumed limits can be misleading #4223

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions