-
Notifications
You must be signed in to change notification settings - Fork 65
.to_numpy() for profiles and other binned statistics #511
Description
In scikit-hep/boost-histogram#413, @henryiii and I are currently discussing what the .to_numpy()
method should return for histograms that have more than a trivial storage and need to return an estimate of the variance per bin in addition to the value.
We currently return a view into our storage, e.g. for a boost-histogram with Weight()
storage, that is a record array with value
and variance
fields.
Henry suggests that we should simplify this and only return the values, effectively dropping the uncertainty estimates. I am against this, because for any practical use, the variances are as important as the values. I don't mind dropping meta-data in a .to_numpy()
conversion, but not actual data, which consists equally of values and uncertainty estimates.
Henry further makes the valid point, that we should be consistent with other libraries and explicitly mentions uproot4 here:
.to_numpy could gain an argument view=True, which would cause it to return the view instead of just the values (the default, to match other libraries like Uproot 4).
I think:
- we have some freedom to decide what
.to_numpy()
conversion means in this case, since we lack a template for this situation from numpy - the conversion should not discard uncertainty estimates, because this would make
.to_numpy()
unusable in practice - we should try to find a common format what
.to_numpy()
should return for binned statistics and then consistently use that format in all relevant scikit-hep libraries
If we would normally return (values, edge0, edge1, ...)
, then we could return ((values, variances), edge0, edge1, ...)
or ((values, sigmas), edge0, edge1, ...)
for a binned statistic, where values
and variances
are ordinary numpy arrays.
Returning "sigmas" vs. "variances"
- returning sigmas is more intuitive for the casual user
- returning variance makes more sense from a statistics and computational point of view
- values and variances are additive, sigmas are not
- uncertainties estimators always compute a variance; returning "sigmas" means one has to do extra work by taking the square root, which a user who actually wants the variances has to undo with yet doing more extra work by squaring the sigmas
Boost.Histogram in C++ therefore always returns variances and not sigmas. In C++, this choice is rather clearly dictated by the strong preference for solutions that do minimal work to achieve some goal ("zero overhead" principle).