.to_numpy() for profiles and other binned statistics

In https://github.com/scikit-hep/boost-histogram/issues/413, @henryiii and I are currently discussing what the `.to_numpy()` method should return for histograms that have more than a trivial storage and need to return an estimate of the variance per bin in addition to the value.

We currently return a view into our storage, e.g. for a boost-histogram with `Weight()` storage, that is a record array with `value` and `variance` fields.

Henry suggests that we should simplify this and only return the values, effectively dropping the uncertainty estimates. I am against this, because for any practical use, the variances are as important as the values. I don't mind dropping meta-data in a `.to_numpy()` conversion, but not *actual data*, which consists equally of values *and* uncertainty estimates.

Henry further makes the valid point, that we should be consistent with other libraries and explicitly mentions uproot4 here:

> .to_numpy could gain an argument view=True, which would cause it to return the view instead of just the values (the default, to match other libraries like Uproot 4).

I think:
- we have some freedom to decide what `.to_numpy()` conversion means in this case, since we lack a template for this situation from numpy
- the conversion should not discard uncertainty estimates, because this would make `.to_numpy()` unusable in practice
- we should try to find a common format what `.to_numpy()` should return for binned statistics and then consistently use that format in all relevant scikit-hep libraries

If we would normally return `(values, edge0, edge1, ...)`, then we could return `((values, variances), edge0, edge1, ...)` or `((values, sigmas), edge0, edge1, ...)` for a binned statistic, where `values` and `variances` are ordinary numpy arrays.

Returning "sigmas" vs. "variances"
- returning sigmas is more intuitive for the casual user
- returning variance makes more sense from a statistics and computational point of view
  - values and variances are additive, sigmas are not
  - uncertainties estimators always compute a variance; returning "sigmas" means one has to do extra work by taking the square root, which a user who actually wants the variances has to undo with yet doing more extra work by squaring the sigmas

Boost.Histogram in C++ therefore always returns variances and not sigmas. In C++, this choice is rather clearly dictated by the strong preference for solutions that do minimal work to achieve some goal ("zero overhead" principle).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

.to_numpy() for profiles and other binned statistics #511

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

.to_numpy() for profiles and other binned statistics #511

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions