Thanks to visit codestin.com
Credit goes to github.com

Skip to content
This repository was archived by the owner on Jun 21, 2022. It is now read-only.
This repository was archived by the owner on Jun 21, 2022. It is now read-only.

.to_numpy() for profiles and other binned statistics #511

@HDembinski

Description

@HDembinski

In scikit-hep/boost-histogram#413, @henryiii and I are currently discussing what the .to_numpy() method should return for histograms that have more than a trivial storage and need to return an estimate of the variance per bin in addition to the value.

We currently return a view into our storage, e.g. for a boost-histogram with Weight() storage, that is a record array with value and variance fields.

Henry suggests that we should simplify this and only return the values, effectively dropping the uncertainty estimates. I am against this, because for any practical use, the variances are as important as the values. I don't mind dropping meta-data in a .to_numpy() conversion, but not actual data, which consists equally of values and uncertainty estimates.

Henry further makes the valid point, that we should be consistent with other libraries and explicitly mentions uproot4 here:

.to_numpy could gain an argument view=True, which would cause it to return the view instead of just the values (the default, to match other libraries like Uproot 4).

I think:

  • we have some freedom to decide what .to_numpy() conversion means in this case, since we lack a template for this situation from numpy
  • the conversion should not discard uncertainty estimates, because this would make .to_numpy() unusable in practice
  • we should try to find a common format what .to_numpy() should return for binned statistics and then consistently use that format in all relevant scikit-hep libraries

If we would normally return (values, edge0, edge1, ...), then we could return ((values, variances), edge0, edge1, ...) or ((values, sigmas), edge0, edge1, ...) for a binned statistic, where values and variances are ordinary numpy arrays.

Returning "sigmas" vs. "variances"

  • returning sigmas is more intuitive for the casual user
  • returning variance makes more sense from a statistics and computational point of view
    • values and variances are additive, sigmas are not
    • uncertainties estimators always compute a variance; returning "sigmas" means one has to do extra work by taking the square root, which a user who actually wants the variances has to undo with yet doing more extra work by squaring the sigmas

Boost.Histogram in C++ therefore always returns variances and not sigmas. In C++, this choice is rather clearly dictated by the strong preference for solutions that do minimal work to achieve some goal ("zero overhead" principle).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions