Metrics for prediction intervals

Yesterday, I had a dream where scikit learn has implemented main metrics for prediction intervals.

A review of typical metrics for prediction intervals can be found on this publication (p.16 to 17): 'Review of Deterministic and Probabilistic Wind Power Forecasting: Models, Methods, and Future Research' (https://www.mdpi.com/2673-4826/2/1/2)

Typically : with <img src="https://render.githubusercontent.com/render/math?math=L_{t}"> the lower bound and <img src="https://render.githubusercontent.com/render/math?math=U_{t}"> the upper bound


- PICP (Prediction Interval Coverage Probability)
![image](https://user-images.githubusercontent.com/11358792/122648808-17eae100-d12b-11eb-8fd2-788f0161a507.png)
with 
![image](https://user-images.githubusercontent.com/11358792/122648832-381aa000-d12b-11eb-894d-6ae9ead9019c.png)
PICP is maybe not so useful in itself, but needed to calculate the ACE below, which is really the critical metric.

- PINC (Prediction Interval Nominal Coverage) : the nominal level of uncertainty, e.g. 90% if the quantiles you predict are 5% and 95%
PINC is somewhat a necessary preliminary definition for computing the ACE below.

- ACE (Average Coverage Error) 
![image](https://user-images.githubusercontent.com/11358792/122648875-65674e00-d12b-11eb-912c-d7b5fb4940ea.png)
That metric is really the critical one that really feels missing the most. It represents how much the interval can be trusted, because, for example, when we say that we compute the 90 % confidence interval, we want that close to 90% of the test points rely in that interval. For this reason we usually refer to that metric as measuring the _reliability_ of the interval.

- PINAW (Prediction Interval Normalized Average Width)
![image](https://user-images.githubusercontent.com/11358792/122649001-0eae4400-d12c-11eb-9861-057551ff8fbf.png)
That is the second critical metric that would need be implemented. It measures the _sharpness_ of the interval. This is a very complementary thing to the ACE, as it measures how wide the interval is. Indeed, useful intervals are the one that are reliable of course, but that are also narrow. The ACE does not mesure how narrow/sharp we are, this needs to be measured by the PINAW (or interval score, see below)

- CWC (Coverage Width Criterion)
![image](https://user-images.githubusercontent.com/11358792/122648945-b6774200-d12b-11eb-8fcc-4bc659f25a2f.png)
with 
![image](https://user-images.githubusercontent.com/11358792/122648962-c4c55e00-d12b-11eb-9ad4-66c4e8c35830.png)
<img src="https://render.githubusercontent.com/render/math?math=\mu"> usually equals the PINC and <img src="https://render.githubusercontent.com/render/math?math=\eta"> is a penalty coefficient
That metric I have put here for reference because it is cited in the article. For the moment I personnaly have not used it as it combines the ACE and PINAW, but in a way that still needs to be clarified (to me at least, before I would chose to use it).

- Interval Score (This metric does not come from the same publication as the others, this one comes from Wan, C., Xu, Z., Pinson, P., Dong, Z. Y., & Wong, K. P. (2014). Probabilistic forecasting of wind power generation using extreme learning machine. IEEE Transactions on Power Systems, 29(3). https://doi.org/10.1109/TPWRS.2013.2287871)
![image](https://user-images.githubusercontent.com/11358792/142428632-2d5e560d-7184-40c3-9c6d-46b64e3df227.png)
With 
![image](https://user-images.githubusercontent.com/11358792/142436825-38784be5-312b-4e9a-afe4-c69a8d579888.png) defining the width of the PI.
That is probably the third metric I would find very interesting to add. The reason for this is that the interval score adds to the ACE and PINAW the consideration of how far is the prediction from the interval when it is not in the interval, which can permit to discriminate between two models where predictions would be out of the interval, but the prediction of one model would be closer to the interval's boundary than the prediction of the other model.

Currently sklearn permits to calculate such prediction intervals, typically with GradientBoostingRegressor and quantile loss, but the classic way to calculate the validity of these intervals cannot be done through sklearn as smoothly as for deterministic forecast with MSE, MAE, etc. Implementing several of these metrics would close the gap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metrics for prediction intervals #20162

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Metrics for prediction intervals #20162

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions