-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Description
gh-13312 introduced a decorator that makes it easy to add nan_policy and axis parameters to almost all reduction stats functions (those that "consume" samples along an axis), that is, almost all "summary statistics", "statistical tests", and - if we want - "correlation functions". If we are happy with the decorator, with minor modification it could handle masked arrays by manually eliminating masked elements as each axis-slice is processed, which in many cases would be faster than the existing (Done.)mstats implementations.
The purpose of this issue is to track progress toward a (much) more consistent axis/nan_policy experience throughout scipy.stats and toward adding masked array support to scipy.stats functions. Consistent handling of zero-length slices and empty input is also in scope.
This issue is geared toward reduction functions. When those are in better shape, perhaps this can be used to track other functions (e.g. see gh-8669).
For now, I'll just link to a spreadsheet I started that summarizes the status of nan_policy and axis support for stats functions.
https://docs.google.com/spreadsheets/d/1yBhu3Ihy9_xhDh5N9GgoNVyFbUHKWvt_lCuHm5BZ06I/edit?usp=sharing
Should this be converted to markdown? Should I allow anyone to edit the spreadsheet? Or can you think of a better way to track this @tupui? (agreed to stick with the spreadsheet for now)
Other improvement we discussed for the _axis_nan_policy decorator:
- make it easier to apply to functions with number of outputs other than two (notes for that written as comments in ENH: stats: add
axisandnan_policyparameters to functions with decorator #13312 and preserved locally) - generalize to handle axis tuples (ENH: stats: add axis tuple support to _axis_nan_policy_factory decorators #15257).
- generalize to deal with masked arrays (ENH: stats: add masked array support to
_axis_nan_policy_factorydecorators #15239) - add
keepdimsto all reduction functions -
addNo. Inputs should be converted to lowest common dtype and preserved throughout the calculation. Ensuring that this behavior is correct is separate from this issue and is easier to handle when doing array API transllations.dtypeto all functions? - Ensure that single-precision (and other relevant data types) are still supported (e.g. issue reported in ENH: optimize: maintain user dtype #15602)
- Enforce output dtype (see ENH: stats: add axis tuple and nan_policy to
semandiqr#17971) - explore the possibility of finding all NaNs in an array at once rather than once per axis-slice. This would probably require using
np.ndenumerateinstead of thenp.apply_along_axisapproach - Check that tutorials explain
axisandnan_policyarguments well so that individual functions do not need to document the common behavior in great detail. - Should a warning about invalid values ever be raised? If so, make it consistent.
- Link to the correct source when user clicks "Source" from the documentation of a function wrapped by _axis_nan_policy decorator (DOC, MAINT: fix links to wrapped functions and SciPy's distributions #15637)
- make sure documentation of each function is clear about whether data must be real and that a warning or error is raised if unsupported data is passed
- Allow the wrapped function to perform input validation even when the decorator produces the result? (e.g. empty output, all NaNs).
Related issues/PRs:
gh-2178
gh-2324
gh-4086
gh-5432
gh-5474 (would be irrelevant if all stats functions accept masked arrays)
gh-6416
gh-6551
gh-6654
gh-7178
gh-7342
gh-9307
gh-9558
gh-9409
gh-9252
gh-11790
gh-11409
gh-11355
gh-12143
gh-12241
gh-12548
gh-12916
gh-13223
gh-13215
gh-13844
gh-13900
gh-14421
gh-14651
gh-14725
gh-15375
gh-15630
gh-15660
gh-17154
gh-17288
gh-19039
Other information:
The decorator does some introspection to check whether axis and nan_policy are already parameters. Whatever is already accepted, it continues to accept it (as a positional or keyword argument - however it is currently allowed ). For whatever is not already accepted, the decorator adds a keyword-only argument and updates the documentation.
If axis is already a parameter, it uses that existing behavior when there are no NaNs. That is, it takes advantage of the function's native vectorization for efficiency when possible. If there are NaNs or if axis is not already a parameter, it uses np.applyalongaxis to loop over the axis-slices. It overrides any existing nan_policy behavior, manually removing NaNs of each axis-slice instead of using masked arrays.