ENH: stats: consistent `nan_policy`, `axis`, masked array, and dtype support

gh-13312 introduced a decorator that makes it easy to add `nan_policy` and `axis` parameters to almost all reduction stats functions (those that "consume" samples along an axis), that is, almost all "summary statistics", "statistical tests", and - if we want - "correlation functions". ~~If we are happy with the decorator, with minor modification it could handle masked arrays by manually eliminating masked elements as each axis-slice is processed, which in many cases would be faster than the existing `mstats` implementations.~~(Done.)

The purpose of this issue is to track progress toward a (much) more consistent `axis`/`nan_policy` experience throughout `scipy.stats` and toward adding masked array support to `scipy.stats` functions. Consistent handling of zero-length slices and empty input is also in scope.

This issue is geared toward reduction functions. When those are in better shape, perhaps this can be used to track other functions (e.g. see gh-8669).

For now, I'll just link to a spreadsheet I started that summarizes the status of `nan_policy` and `axis` support for stats functions. 
https://docs.google.com/spreadsheets/d/1yBhu3Ihy9_xhDh5N9GgoNVyFbUHKWvt_lCuHm5BZ06I/edit?usp=sharing

~~Should this be converted to markdown? Should I allow anyone to edit the spreadsheet? Or can you think of a better way to track this @tupui?~~ (agreed to stick with the spreadsheet for now)

Other improvement we discussed for the `_axis_nan_policy` decorator:
- [x] make it easier to apply to functions with number of outputs other than two (notes for that written as comments in gh-13312 and preserved locally)
- [x] generalize to handle axis tuples (gh-15257).
- [x] generalize to deal with masked arrays (gh-15239)
- [x] add `keepdims` to all reduction functions
- [ ] ~~add `dtype` to all functions?~~ No. Inputs should be converted to lowest common dtype and preserved throughout the calculation. Ensuring that this behavior is correct is separate from this issue and is easier to handle when doing array API transllations. 
- [x] Ensure that single-precision (and other relevant data types) are still supported (e.g. issue reported in gh-15602)
- [ ] Enforce output dtype (see gh-17971)
- [ ] explore the possibility of finding all NaNs in an array at once rather than once per axis-slice. This would probably require using `np.ndenumerate` instead of the `np.apply_along_axis` approach
- [ ] Check that tutorials explain `axis` and `nan_policy` arguments well so that individual functions do not need to document the common behavior in great detail.
- [ ] Should a warning about invalid values ever be raised? If so, make it consistent.
- [x] Link to the correct source when user clicks "Source" from the documentation of a function wrapped by _axis_nan_policy decorator (gh-15637)
- [ ] make sure documentation of each function is clear about whether data must be real and that a warning or error is raised if unsupported data is passed
- [ ] Allow the wrapped function to perform input validation even when the decorator produces the result? (e.g. empty output, all NaNs).

Related issues/PRs:
gh-2178
gh-2324
gh-4086
gh-5432
gh-5474 (would be irrelevant if all `stats` functions accept masked arrays)
gh-6416
gh-6551
gh-6654
gh-7178
gh-7342
gh-9307
gh-9558
gh-9409
gh-9252
gh-11790
gh-11409
gh-11355
gh-12143
gh-12241
gh-12548
gh-12916
gh-13223
gh-13215
gh-13844
gh-13900
gh-14421
gh-14651
gh-14725
gh-15375
gh-15630
gh-15660
gh-17154
gh-17288
gh-19039

Other information:
The decorator does some introspection to check whether  `axis` and `nan_policy` are already parameters. Whatever is already accepted, it continues to accept it (as a positional or keyword argument - however it is currently allowed ). For whatever is _not_ already accepted, the decorator adds a keyword-only argument and updates the documentation.

If `axis` is already a parameter, it uses that existing behavior when there are no NaNs. That is, it takes advantage of the function's native vectorization for efficiency when possible.  If there are NaNs or if `axis` is not already a parameter, it uses `np.applyalongaxis` to loop over the axis-slices. It overrides any existing `nan_policy` behavior, manually removing NaNs of each axis-slice instead of using masked arrays.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: stats: consistent `nan_policy`, `axis`, masked array, and dtype support #14651

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: stats: consistent nan_policy, axis, masked array, and dtype support #14651

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

ENH: stats: consistent `nan_policy`, `axis`, masked array, and dtype support #14651