Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Aug 28, 2021

Reference issue

gh-14651

What does this implement/fix?

This adds parameters axis and nan_policy to gmean.

Note: initially, only added to gmean. There is so much variety in what these functions currently do (e.g. with small samples, when nd-arrays are passed without axis, etc.) that we realy should do these one at a time.

A few important things to think about with functions like these are 1) whether applying the decorator breaks backward compatibility, 2) whether applying the decorator reduces the speed considerably, and 2) whether the wrapper's implementation of nan_policy='propagate' makes sense for the function. Some specific things to look for:

  1. If the function was not vectorized before, did it ravel the input before applying the function? (If that's the case, then perhaps the new default_axis should be None, and perhaps this should be deprecated so we can start using default_axis=0 like most functions.)
  2. If the function was natively vectorized before, does the wrapper take advantage of that native vectorization when possible? (E.g. add breakpoints to check, or check execution time before and after wrapper.)
  3. Does the function still raise errors like it used to (e.g. when there are too few observations)? Do we want it to, or do we want it to return NaNs?
  4. If the function did not implement nan_policy, what was the behavior in case of NaNs? Did it propagate NaNs correctly, and does the new default nan_policy='propagate' propagate NaNs correctly?
  5. If the function accepted positional axis and nan_policy arguments before, does it still? Has the documentation been updated correctly?

@mdhaber
Copy link
Contributor Author

mdhaber commented Aug 29, 2021

Here is a script that helps check some of the items in the list above.

Details
import numpy as np
from scipy import stats

fun = stats.gmean

def try_case(x, message, kwds):
    try:
        print(message)
        print(fun(x, **kwds))
    except Exception as e:
        print(f"{type(e)}: {str(e)}")

    print("----")

random_array = np.random.rand(4, 6)
cases = [
         ([], 'Zero Observations:', {}),
         ([1.], 'One Observation:', {}),
         ([1., 2.], 'Two Observations', {}),
         (random_array, 'Array, no axis', {}),
         (random_array, 'Array, axis=0', {'axis': 0}),
         (random_array, 'Array, axis=1', {'axis': 1}),
         ([np.nan, 1., 2., 3.], 'No nan_policy', {}),
         ([np.nan, 1., 2., 3.], 'raise', {'nan_policy': 'raise'}),
         ([np.nan, 1., 2., 3.], 'propagate', {'nan_policy': 'propagate'}),
         ([np.nan, 1., 2., 3.], 'omit', {'nan_policy': 'omit'}),
         ([np.nan, np.nan, np.nan], 'omit empty', {'nan_policy': 'omit'}),
         ([np.nan, np.nan, 1.], 'omit all but one', {'nan_policy': 'omit'}),
         ]

for case in cases:
    try_case(*case)

In master, the output for gmean is:

Details
Zero Observations:
nan
----
One Observation:
1.0
----
Two Observations
1.414213562373095
----
Array, no axis
[0.34647615 0.64403765 0.4425809  0.49827879 0.40714294 0.32474933]
----
Array, axis=0
[0.34647615 0.64403765 0.4425809  0.49827879 0.40714294 0.32474933]
----
Array, axis=1
[0.63684046 0.24778452 0.47081871 0.46910639]
----
No nan_policy
nan
----
raise
<class 'TypeError'>: gmean() got an unexpected keyword argument 'nan_policy'
----
propagate
<class 'TypeError'>: gmean() got an unexpected keyword argument 'nan_policy'
----
omit
<class 'TypeError'>: gmean() got an unexpected keyword argument 'nan_policy'
----
omit empty
<class 'TypeError'>: gmean() got an unexpected keyword argument 'nan_policy'
----
omit all but one
<class 'TypeError'>: gmean() got an unexpected keyword argument 'nan_policy'
----

In this branch, the output for gmean is:

Details
Zero Observations:
nan
----
One Observation:
1.0
----
Two Observations
1.414213562373095
----
Array, no axis
[0.32375847 0.24241377 0.49835084 0.26744938 0.57350468 0.41571854]
----
Array, axis=0
[0.32375847 0.24241377 0.49835084 0.26744938 0.57350468 0.41571854]
----
Array, axis=1
[0.18033934 0.47217042 0.43996559 0.49089253]
----
No nan_policy
nan
----
raise
<class 'ValueError'>: The input contains nan values
----
propagate
nan
----
omit
1.8171205928321397
----
omit empty
nan
----
omit all but one
1.0
----
The answers to the questions above:
  1. The function was vectorized before, and it seems to work as expected. The default axis appears to be 0.
  2. I ran fun(random_array) and checked inside gmean. The full array is passed in with axis=0.
  3. The function did not raise any desirable errors for small samples.
  4. NaNs appeared to be propagated before, and it seems that returning nan is the right behavior for nan_policy='propagate' when there are NaNs in the input.
  5. fun(random_array, 0) and fun(random_array, 1) work as expected after the decorator is applied. In the updated documentation, axis is still the second argument, and nan_policy has been appended as the last.

For examples of the sort of variety we're going to experience - and why we should go through these carefully - try this script on the following functions.

  • hmean - currently raises a ValueError: Harmonic mean only defined if all elements greater than or equal to zero when there are NaNs. I think in this case the default nan_policy should be None, we deprecate the use without explicit nan_policy, and then we can drop the deprecation warning and change the default to 'propagate' like everything else in a few versions.
  • kstat/kstatvar - raises an error for empty input, yet produces nan for one observation (and two, in the case of kstatvar). ravels 2d input.
  • tmean and tsem - has an axis argument, but ignores it, and always ravels the input.
  • entropy - returns 0 for empty input. I doubt that should be considered correct.
  • many for which axis and nan_policy are already implemented: return masked arrays as output when the input has NaNs and nan_policy='omit'. Returns a weird MaskedConstant when the input is empty after omitting nans.

@mdhaber mdhaber requested a review from tupui August 31, 2021 03:13
Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, another alternative could have been to use a list of indices for the output. But declaring this using lambda makes it also clear. I don't have strong opinion here so not requesting anything.

@mdhaber mdhaber changed the title ENH: stats: add axis and nan_policy to single-output summary statistics ENH: stats: add axis and nan_policy to gmean Aug 31, 2021
@mdhaber
Copy link
Contributor Author

mdhaber commented Aug 31, 2021

@Kai-Striega @V0lantis Look good to you? (See next comment.)

I think I'll just go straight down the list. Next up would be hmean, and the funny thing we'll have to decide on there is that it currently returns a ValueError when passed NaNs. Do you agree with the suggestion:

I think in this case the default nan_policy should be None, we deprecate the use without explicit nan_policy, and then we can drop the deprecation warning and change the default to 'propagate' like everything else in a few versions.

Or can we do something simpler?

@mdhaber
Copy link
Contributor Author

mdhaber commented Aug 31, 2021

I just noticed that something else we need to look out for: existing behavior with masked arrays. gmean and hmean work with masked arrays.

image

Looks like the decorator needs to ignore masked elements of masked arrays before we can merge this.

@mdhaber
Copy link
Contributor Author

mdhaber commented Sep 1, 2021

@tupui @Kai-Striega @tirthasheshpatel @V0lantis A few options for how the decorator might handle masked arrays come to mind:

  1. Try to actually work with the masked arrays, if the function supports it. (Of course, we would then also have to implement another strategy to accommodate functions that don't work with masked arrays.)
  2. Separate the masked array's mask from the data and remove data from each axis slice as indicated by the mask
  3. Replace masked elements with a sentinel value, and omit occurrences of this sentinel value from the calculation (exactly as we currently omit NaN when nan_policy='omit').

I don't really want to do 1. Part of this effort is to move away from the use of masked arrays in the internals of scipy.stats. (Let me know if I need to justify this.) In any case, there aren't too many regular stats functions (ignoring mstats) that do work directly with masked arrays, so I don't think it's worth the effort to make the decorator pass them through to the function, at least not in this PR.

2 is OK, but then we'd be dragging a mask around in parallel with the data array. I think this would be a bit more work than option 3. The trouble with 3 is that if nan_policy='propagate', we can't use np.nan as the sentinel value - we have to propagate the NaNs but omit the masked elements. Instead, we can write some code to find a value that is not already in the data, and use that as the sentinel value. (Again, occurrences of this sentinel value would be omitted from each 1D axis-slice before performing the calculation.)

Let me know what you think!

@tupui
Copy link
Member

tupui commented Sep 1, 2021

I think both options 2 and 3 would be similar no? Being a decorator, we can pre-process the data before and have them back after the way we want. I would be more for option 2 than 3 I think. It just seem to introduce less code to manage a fictive value.

Copy link
Contributor

@V0lantis V0lantis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kai-Striega @V0lantis Look good to you? (See next comment.)

I think I'll just go straight down the list. Next up would be hmean, and the funny thing we'll have to decide on there is that it currently returns a ValueError when passed NaNs. Do you agree with the suggestion:

I think in this case the default nan_policy should be None, we deprecate the use without explicit nan_policy, and then we can drop the deprecation warning and change the default to 'propagate' like everything else in a few versions.

Or can we do something simpler?

Yeah, I think this is the way to go. We could also directly go with the default nan_policy=propagate, raising a warning not to surprise anyone, and see if we see any complaints, but that might be a little bit harsh on those who rely on the code.

Also, I tested locally the vectorization of Kruskal, so that I could understand better how the whole code works.
I had to fulfill all arrays with np.nan. Would there be a futur upgrade in the futur where we could maybe directly pass X and Y directly to Kruskal (or every other functions supporting axis param)?

import numpy as np
from scipy import stats

# Need to define each lines with np.array, to e able to assign directly to the concatenate array
X = [
    np.array([1, 3, 5, 7, 9]),
    np.array([1]),
    np.array([1, 1, 1, 2]),
    np.array([1, 1, 1]),
]
Y = [
    np.array([2, 4, 6, 8, 10]),
    np.array([1, 2]),
    np.array([2, 2, 2, 2]),
    np.array([2, 2, 2]),
]

def broadcasting_kruskal():
    x = np.array([1, 3, 5, 7, 9], )
    y = np.array([2, 4, 6, 8, 10], )
    X_concatenate = np.concatenate(
        (x.reshape(1, -1), np.full((len(X) - 1, x.shape[
            0]), np.nan)))
    Y_concatenate = np.concatenate(
        (y.reshape(1, -1), np.full((len(X) - 1, y.shape[
            0]), np.nan)))
    for index, (a, b) in enumerate(zip(X[1:], Y[1:])):
        X_concatenate[index + 1, :a.shape[0]] = a
        Y_concatenate[index + 1, :b.shape[0]] = b

    return stats.kruskal(X_concatenate, Y_concatenate, nan_policy="omit",
                       axis=1)

def kruskcal_with_for_loop():
    for a, b in zip(X, Y):
        yield stats.kruskal(a, b)


print(broadcasting_kruskal())
print(list(kruskcal_with_for_loop()))

It is how it is supposed to work?

Unfortunately, I noticed that the vectorized solution is slower than the for ... in loop :

%timeit broadcasting_kruskal()
1.92 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit kruskcal_with_for_loop()
219 ns ± 6.93 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Has it something to do with np.ndenumerate, that you mention in your comment in gh-14651

@mdhaber
Copy link
Contributor Author

mdhaber commented Sep 1, 2021

Stats functions that are vectorized are not intended to support jagged arrays. You can pass in just x and y - as long as all rows have the same number of elements.
If you want an example that will work, do

x = np.random.rand(4, 6)
y = np.random.rand(4, 7)
stats.kruskal(x, y, axis=1)

If you want to do computations with jagged arrays, you should use a for loop. You can insert nans and use nan_policy='omit', but i can't guarantee that will be as fast as a loop, especially if you're including the time it takes for you to prepare the arrays.

@Kai-Striega
Copy link
Member

I'm a little bit behind with the discussion so I'll try to address a few things now and then catch up tomorrow

I don't really want to do 1. Part of this effort is to move away from the use of masked arrays in the internals of scipy.stats. (Let me know if I need to justify this.)

+1

The trouble with 3 is that if nan_policy='propagate', we can't use np.nan as the sentinel value - we have to propagate the NaNs but omit the masked elements. Instead, we can write some code to find a value that is not already in the data, and use that as the sentinel value.

Wouldn't this require a significant performance cost for anything except very small arrays?

Although I don't know if it's relevant to this particular PR, the decorator still uses statistic and pvalue fields internally, if we want to generalize this to more functions perhaps it is clearer to use more generic names?

@mdhaber
Copy link
Contributor Author

mdhaber commented Sep 2, 2021

Wouldn't this require a significant performance cost for anything except very small arrays?

Why? Or, what part of it?

If there are no NaNs and no masks, then there is no overhead because this stuff can be skipped.

There are many strategies I can think of which would have some small overhead (which is a price we are already paying for the huge improvement in scipy.stats interface).

For finding a valid sentinel value, for example: if there are no np.infs in the data, use np.inf. If there are np.infs, take the max of the finite values of the data and use the next largest value representable by the dtype as the sentinel. If this fails (if the data already contains the largest number representable as a float), choose random values until one is found that is not in the data set. If the dataset contains all possible values of the given type, yeah, this won't work : )

For applying the sentinel value, there is not much cost, e.g. sample[sample.mask] = sentinel.

For eliminating sentinel values, there is not much more overhead than we already have to check for nans and remove those.

In any case, I think even moderate overhead if there are NaNs/masks is acceptable for the interface improvement, because the overhead can (in principle) be reduced, and the overhead we add is offset by the improvement moving away from masked arrays. Try, for example:

import numpy as np
n = 1000
x = np.random.rand(n)
mask = np.random.rand(n) > 0.5
y = np.ma.masked_array(x, mask=mask)
%timeit np.mean(x[~mask])  # 9.89 µs ± 155 ns per loop
%timeit np.mean(y)  # 21.7 µs ± 185 ns per loop

I think initially we can focus on correctness and test coverage so that we have a really thorough suite if/when we optimize performance.

Although I don't know if it's relevant to this particular PR, the decorator still uses statistic and pvalue fields internally.

I don't think that's true after this PR. Searching the modified _axis_nan_policy.py for "statistic" and "pvalue" has no hits in the code. The documentation mentions these as examples. The tests do still have some references to "statistic" and "pvalue" which may not be appropriate anymore, but I think we can improve these a little bit at a time.

@mdhaber
Copy link
Contributor Author

mdhaber commented Sep 2, 2021

Thanks for the feedback, everyone. Re @tupui:

I think both options 2 and 3 would be similar no?

The ideas are similar. The code is quite different. Option 2 would require the same sort of modifications as the earlier suggestion to check the inputs for NaNs once (per input) rather than once per axis-slice. For that reason, I'm going to proceed with Option 3, which is much simpler to implement. Later, in an independent PR, we can investigate the one-time check for NaNs (and Option 2 for handling masked arrays) to see if that would be more efficient.

Re @V0lantis:

Unfortunately, I noticed that the vectorized solution is slower than the for ... in loop :

In your time comparison; you did not iterate over it your generator; you only instantiated it. The difference in execution times is actually negligible.

>>> %timeit list(kruskcal_with_for_loop())
1.18 ms ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit broadcasting_kruskal()
1.24 ms ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Also, it would have been a little less inconvenient (and more efficient) to set up the NaN-filled arrays by starting with all-NaN arrays and filling them with the data like:

    shape = len(X), max(len(x) for x in X)
    X2 = np.full(shape, np.nan)
    Y2 = np.full(shape, np.nan)
    for i in range(len(X)):
        x, y = X[i], Y[i]
        X2[i, :len(x)] = x
        Y2[i, :len(y)] = y

@Kai-Striega
Copy link
Member

Why? Or, what part of it?

I was thinking of finding a good sentinel value (by repeatedly taking a random number and checking if it's in the array) would be very inefficient. Maybe I was just tired and didn't think of the obvious way. Taking the max +1 (or similar) seems good. Looking at the profiling above it looks like my suspicions were wrong anyway.

In any case, I think even moderate overhead if there are NaNs/masks is acceptable for the interface improvement, because the overhead can (in principle) be reduced, and the overhead we add is offset by the improvement moving away from masked arrays.

+1

I think initially we can focus on correctness and test coverage so that we have a really thorough suite if/when we optimize performance.

+1

Although I don't know if it's relevant to this particular PR, the decorator still uses statistic and pvalue fields internally.

I don't think that's true after this PR. Searching the modified _axis_nan_policy.py for "statistic" and "pvalue" has no hits in the code. The documentation mentions these as examples. The tests do still have some references to "statistic" and "pvalue" which may not be appropriate anymore, but I think we can improve these a little bit at a time

You're right again. I was experimenting with the code from master on my local machine and it stuck out, then I looked at the PR and saw it in the docs/tests and though it would be worth bringing up. The code also still uses hypotest_fun_in or similar in several places.

Going with option 3 looks like a good idea. I think it will be best to wait for the decorator to stabilize before I attempt to apply it to any new functions.

@V0lantis
Copy link
Contributor

V0lantis commented Sep 5, 2021

Re @V0lantis:

Unfortunately, I noticed that the vectorized solution is slower than the for ... in loop :

In your time comparison; you did not iterate over it your generator; you only instantiated it. The difference in execution times is actually negligible.

>>> %timeit list(kruskcal_with_for_loop())
1.18 ms ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit broadcasting_kruskal()
1.24 ms ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Also, it would have been a little less inconvenient (and more efficient) to set up the NaN-filled arrays by starting with all-NaN arrays and filling them with the data like:

    shape = len(X), max(len(x) for x in X)
    X2 = np.full(shape, np.nan)
    Y2 = np.full(shape, np.nan)
    for i in range(len(X)):
        x, y = X[i], Y[i]
        X2[i, :len(x)] = x
        Y2[i, :len(y)] = y

Thank you, effectively, you are right, nice spotted here.

For the rest, I think @Kai-Striega summered it very well. Looks very good to me

@tirthasheshpatel
Copy link
Member

Sorry, it took me some time to get to this. I reviewed both this and #13312

A few options for how the decorator might handle masked arrays come to mind:

I think option 3 sounds best right now. Option 1 will be much more work and stall an important improvement. Option 2 is good but would require some more code.

I think one of the ways to tackle the problem of sentinel value being nan when nan_policy is propogate or raise: put nan as sentinel value and call _remove_nans just before calling the function (i.e. hypotest_fun_in). By the time the wrapper reaches the function, all the nan values have been processed e.g. if nan_policy was propogate and nans were present, nan result object has been returned. If no nans were found, we are free to use and ignore nans.

This should also not introduce much overhead and get the work done, I guess.

Does this work, @mdhaber?

@mdhaber
Copy link
Contributor Author

mdhaber commented Sep 5, 2021

I think one of the ways to tackle the problem of sentinel value being nan when nan_policy is propogate or raise: put nan as sentinel value and call _remove_nans just before calling the function (i.e. hypotest_fun_in). By the time the wrapper reaches the function, all the nan values have been processed e.g. if nan_policy was propogate and nans were present, nan result object has been returned. If no nans were found, we are free to use and ignore nans.

Yes, I considered that. But I think that if we're already writing a routine to find/use a separate sentinel value for nan_policy='omit', I think it's conceptually a bit simpler if we use that same sentinel value regardless of nan_policy. There is a bit of overhead in determining an allowable sentinel value and a little overhead to check for this sentinel value separate from nans, but I don't think that's too important right now.

I'll think about it again when I'm working on it. There might be a delay since the school year is starting and I'm trying to get gh-13490 done.

@Kai-Striega
Copy link
Member

I've been thinking about good ways to find the sentinel value a bit lately. Taking max + 1 seems to do slightly more work than we need to. A simpler approach would be using the inbuilt np.finfo/np.iinfo and using the max/min representable value of that dtype. According to the docs these functions are cached for each datatype, meaning that they are essentially free to call. These will also include the resolution so it would be simpler to got max - resolution to pick an alternate sentinel.

@mdhaber
Copy link
Contributor Author

mdhaber commented Sep 21, 2021

(School is starting and it's a very busy time, so I've needed to put this on hold a bit.)
Yes, it might even be what's coded in a not-yet-posted commit. You're right, both only fail when the data contains the max possible value, so between the two, assuming we can use the max possible value as the sentinel can only be faster. Didn't think about that.
On the other hand, i was thinking that the max value of a data type might be a bit more likely to occur than a random value of a similar, high magnitude. And ultimately, if all the fixed sentinel values we have in mind are already in the data we need something non-deterministic to fall back on until we find a value that isn't in the data. So there any advantage to using a hard-coded sentinel or should we just start with a random guess?

@Kai-Striega
Copy link
Member

Kai-Striega commented Sep 22, 2021

(School is starting and it's a very busy time, so I've needed to put this on hold a bit.)

Sorry. I didn't mean to imply that you need to be working on this. I was working on something else and thought of this and wanted to write it down before my mind moved onto the next thing.

On the other hand, i was thinking that the max value of a data type might be a bit more likely to occur than a random value of a similar, high magnitude.

I didn't think about that. I guess it will be different depending on how the user handles values that are too large.

And ultimately, if all the fixed sentinel values we have in mind are already in the data we need something non-deterministic to fall back on until we find a value that isn't in the data.

I'm not sure that we will need anything non-deterministic, perhaps just using max - eps until we have a value that is [not] included. I don't think the likelihood of data containing max - 3*eps (or 17, or any other number) is going to be higher than any other random number.

@Kai-Striega
Copy link
Member

The latter would require considerably fewer changes, but is more ad hoc. Can I go that route for now? We're probably going to keep running into these little things as we wrap more functions, and I think it would be much more efficient to get everything wrapped, making ad-hoc changes as necessary, then use what we've learned to clean up the decorator once at the end.

+1 from me. Let's just make sure we keep track of these ad hoc decisions somewhere so that they actually get implemented before this is all wrapped up.

@mdhaber
Copy link
Contributor Author

mdhaber commented Dec 27, 2021

Let's just make sure we keep track of these ad hoc decisions somewhere

@Kai-Striega Yup, this is a comment in the code:

# Future refactoring idea: no need for callable n_samples.
# Just replace `n_samples` and `kwd_samples` with a single
# list of the names of all samples, and treat all of them
# as `kwd_samples` are treated below.

@Kai-Striega
Copy link
Member

It's taking me a while to read into this, but I've started now and I should get a full review in the next day or two. I wanted to check if there was anything specific you'd like me to focus on? @mdhaber

@mdhaber
Copy link
Contributor Author

mdhaber commented Dec 28, 2021

@Kai-Striega Thanks! Sorry it's complicated. If you're going to look at code, I'd suggest focusing on the changes since @tupui approved. I think the most valuable things would be to check 1-5 above and that gmean now works as you'd expect it to whether the inputs a and weights are 1D or Nd, have NaNs, or are masked arrays.

Another thing you might want to confirm is that the gmean in master is wrong when weights is a masked array and that this PR fixes it.

import numpy as np
from scipy.stats import gmean

rng = np.random.default_rng(0)
a = rng.random(5)
weights = rng.random(5)

mask_a = [True, False, False, False, True]
mask_weights = [False, False, True, False, False]

a_masked = np.ma.masked_array(a, mask = mask_a)
weights_masked = np.ma.masked_array(weights , mask = mask_weights)

# this is the mask that _should_ be applied to both `a` and `weights`
joint_mask = np.logical_or(mask_a, mask_weights )
# see, for example, the behavior of np.ma.average
avg1 = np.ma.average(a_masked, weights=weights_masked)
avg2 =  np.average(a[~joint_mask], weights=weights[~joint_mask])
np.testing.assert_allclose(avg1, avg2)

# gmean is wrong in master, fixed by PR
res1 = gmean(a_masked, weights=weights_masked)
res2 = gmean(a[~joint_mask], weights=weights[~joint_mask])
np.testing.assert_allclose(res1, res2)

The problem was that gmean uses np.average instead of np.ma.average internally and there is an issue with np.average when weights is a masked array (numpy/numpy#7330).

@Kai-Striega
Copy link
Member

Another thing you might want to confirm is that the gmean in master is wrong when weights is a masked array and that this PR fixes it.

It looks like there are still some problems when gmean is passed an array with more than 1 dimension. I've modified your original code snippet to reproduce the error:

import numpy as np
from scipy.stats import gmean


if __name__ == "__main__":
    rng = np.random.default_rng(0)
    SIZE = (50, 15)
    P_TRUE = 0.1
    
    a = rng.random(size=SIZE)
    weights = rng.random(size=SIZE)
    mask_a = rng.choice([True, False], size=SIZE, p=[P_TRUE, 1 - P_TRUE])
    mask_weights = rng.choice([True, False], size=SIZE, p=[P_TRUE, 1 - P_TRUE])

    a_masked = np.ma.masked_array(a, mask=mask_a)
    weights_masked = np.ma.masked_array(weights , mask=mask_weights)

    # this is the mask that _should_ be applied to both `a` and `weights`
    joint_mask = np.logical_or(mask_a, mask_weights)

   # see, for example, the behavior of np.ma.average
    avg1 = np.ma.average(a_masked, weights=weights_masked)
    avg2 =  np.average(a[~joint_mask], weights=weights[~joint_mask])
    np.testing.assert_allclose(avg1, avg2)

    # gmean is wrong in master, fixed by PR
    res1 = gmean(a_masked, weights=weights_masked)
    res2 = gmean(a[~joint_mask], weights=weights[~joint_mask])
    np.testing.assert_allclose(res1, res2)

@mdhaber
Copy link
Contributor Author

mdhaber commented Dec 29, 2021

@Kai-Striega a[~joint_mask] removes all the joint-masked elements but ravels the array (because otherwise it would be ragged), so the input is buggy. (If we could have ragged arrays, this would be a lot easier!) You need to loop over the rows to get the reference res2 correct.

res1 = gmean(a_masked, weights=weights_masked, axis=-1)
res2 = []
for a_row, weights_row, mask_row in zip(a, weights, joint_mask):
    res2.append(gmean(a_row[~mask_row], weights=weights_row[~mask_row]))
np.testing.assert_allclose(res1, res2)

res1 = gmean(a_masked, weights=weights_masked, axis=0)
res2 = []
for a_row, weights_row, mask_row in zip(a.T, weights.T, joint_mask.T):
    res2.append(gmean(a_row[~mask_row], weights=weights_row[~mask_row]))
np.testing.assert_allclose(res1, res2)

@Kai-Striega
Copy link
Member

I think this is working very well.

  • I've experimented with the wrapper and it appears to work as expected.
  • I don't know of any errors raised by gmean before and this is the same for the wrapper.
  • Does this wrapper also apply to the version written in mstats?
  • There does seem to be some overhead to using the decorator on my machine (as is to be expected), could you see if it's the same on your machine?
import timeit

if __name__ == "__main__":
    setup = """
import numpy as np
from scipy.stats import gmean
rng = np.random.default_rng(0)
SIZE = (500_000, 10)
P_TRUE = 0.1

a = rng.random(size=SIZE)
weights = rng.random(size=SIZE)
mask_a = rng.choice([True, False], size=SIZE, p=[P_TRUE, 1 - P_TRUE])
mask_weights = rng.choice([True, False], size=SIZE, p=[P_TRUE, 1 - P_TRUE])

a_masked = np.ma.masked_array(a, mask=mask_a)
weights_masked = np.ma.masked_array(weights , mask=mask_weights)
"""
    stmt = """
res1 = gmean(a_masked, weights=weights_masked)
"""
    time1 = timeit.repeat(stmt=stmt, setup=setup, number=5, repeat=5)
    print(time1)

on my machine this returns [1.7877098960016156, 1.7905666869992274, 1.7628598699957365, 1.760446991000208, 1.7583883370025433] with the decorator and [0.9298376730002929, 0.9042188910025288, 0.8953376209974522, 0.9058904939956847, 1.1350494560028892] without. Please note that this is with number=5 so the reported time should be divided by 5. I know this isn't the most thorough benchmark available, so if you have a better one let me know.

@mdhaber
Copy link
Contributor Author

mdhaber commented Dec 30, 2021

Does this wrapper also apply to the version written in mstats?

>>> stats.gmean is stats.mstats.gmean
True

So yes, currently. But I'd propose that we deprecate the mstats versions at some point.

There does seem to be some overhead to using the decorator on my machine (as is to be expected),

Yes, there is some. I ran your test using %timeit in master, in master but changing np.average to np.ma.average (to fix the bug), and then in this branch.

master: 163 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
master w/ np.ma.average: 191 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
PR: 260 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For reference, you can also use argument _no_deco=True to bypass the decorator.

>>> %timeit gmean(a_masked, weights=weights_masked, _no_deco=True)
164 ms ± 953 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In the past, I found cases where the decorated version was faster than the mstats version (of some functions). The decorator may be a bit slower now, but there are some ideas for optimization once things settle.

@Kai-Striega
Copy link
Member

So yes, currently. But I'd propose that we deprecate the mstats versions at some point.

Do you plan on deprecating more than just gmean as other wrappers are added? If you do perhaps this should be discussed on the mailing list beforehand?

Yes, there is some. I ran your test using %timeit in master, in master but changing np.average to np.ma.average (to fix the bug), and then in this branch.

Going from 164 to 260 seems like a significant increase. I think some overhead is acceptable (or, at least, unavoidable), but we should keep an eye on it in-case it starts to grow.

there are some ideas for optimization once things settle.

+1

@mdhaber
Copy link
Contributor Author

mdhaber commented Dec 30, 2021

If you do perhaps this should be discussed on the mailing list beforehand?

Of course! I'm not deprecating anything yet. But that sort of change would definitely warrant discussion on the mailing list.

Going from 164 to 260 seems like a significant increase.

I think it's worth it, considering that the output of master's gmean for that test was garbage and the decorator makes it correct!

With the manual fix to gmean (changing np.average to np.ma.average), it's 191 vs 260. (That fix also causes some tests to fail because of the unusual results np.ma.average produces on edge cases, but perhaps with some effort those could be fixed, too.)

But yes, for now, I'm suggesting that we trade some gmean execution time for nan_policy and (IMO more importantly) assurance that gmean's behavior is consistent with other stats functions.

gmean is unusual in that it already had both axis and support for masked arrays. The decorator was originally designed for functions that didn't have either. For those, you get even more benefit for the potential performance cost.

And I'm pretty confident we can optimize to make the decorator faster than any code that uses mased arrays (at least in cases where the number of elements per axis-slice is big compared to the number of axis slices) because masked arrays are so slow. If we manually loop over the columns in your example, e.g.

def f():
    res2 = np.empty(10)
    for i in range(10):
        mask = ~(a_masked[:, i].mask | weights_masked[:, i].mask)
        a_compressed = a_masked.data[:, i][mask]
        weights_compressed = weights_masked.data[:, i][mask]
        res2[i] = gmean(a_compressed, weights=weights_compressed, _no_deco=True)
    return res2

I get

>>> %timeit f()
116 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Looping over the rows is a different story, but of course we could always make the decorator smart and have it make use of the masked array capabilities of gmean if it decides that would be faster.

@tirthasheshpatel
Copy link
Member

In comment #14657 (comment)

The latter would require considerably fewer changes, but is more ad hoc. Can I go that route for now?

After reviewing the code, I think the second option is better (not just ad hoc) than the first because it simply works with stats functions that take arbitrary number of samples (i.e. functions func(*args, ...) where args are samples.). I think that isn't possible if we have something like sample_names.

Copy link
Member

@tirthasheshpatel tirthasheshpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little late to this but went through all the changes and comments. Changes in the decorator look good; tests are very strong. I still need to experiment a little but otherwise, it has my approval. As @Kai-Striega has already experimented, feel free to merge once you think this is ready.

@mdhaber
Copy link
Contributor Author

mdhaber commented Dec 30, 2021

I think that isn't possible if we have something like sample_names.

Agreed, and that's one of the reasons it requires fewer changes. There would need to be some sort of special case to handle *args with sample_names. But that could be sample_names=['*args'] for example. So it could still all be handled with one argument.

Right now, we still have to treat special cases. The usual case is that the function takes a fixed number of samples as its first n_samples arguments. One special case is that the number of samples is variable, which I anticipated, and that's why n_samples can be a callable. The second special case is that there are other arguments not among the first n_samples that are also to be treated as samples, kwd_samples. We handle it in two separate arguments only because I didn't notice the second special case when I was starting out : )

I think a single argument, kwd_samples handles the variable number of samples case much more elegantly than the callable n_samples, it would work with keyword-only samples (if it ever comes up), and it wouldn't require unusual effort to make *args work. There would be fewer lines of code overall, and there would be less to remember when applying the decorator.

Anyway, I'm happy with the approach as-is for now with just a note in the code about the idea:

# Future refactoring idea: no need for callable n_samples.
# Just replace `n_samples` and `kwd_samples` with a single
# list of the names of all samples, and treat all of them
# as `kwd_samples` are treated below.
n_samp = n_samples(kwds)

Thanks!

@mdhaber mdhaber changed the title ENH: stats: add axis and nan_policy to gmean ENH: stats: add axis tuple and nan_policy to gmean Dec 30, 2021
@mdhaber
Copy link
Contributor Author

mdhaber commented Dec 31, 2021

@tirthasheshpatel Added that note about matrix objects and copied empty_output as you suggested. Thanks!

@Kai-Striega
Copy link
Member

I think it's worth it, considering that the output of master's gmean for that test was garbage and the decorator makes it correct!

With the manual fix to gmean (changing np.average to np.ma.average), it's 191 vs 260. (That fix also causes some tests to fail because of the unusual results np.ma.average produces on edge cases, but perhaps with some effort those could be fixed, too.)

But yes, for now, I'm suggesting that we trade some gmean execution time for nan_policy and (IMO more importantly) assurance that gmean's behavior is consistent with other stats functions.

Sorry if that came across poorly. I think it's worth it too. I think it came across differently in my original comment which could have been phrased more positively.

And I'm pretty confident we can optimize to make the decorator faster than any code that uses mased arrays (at least in cases where the number of elements per axis-slice is big compared to the number of axis slices) because masked arrays are so slow.

Sounds good. I'd be happy to look at the optimizations but let's keep that in it's own PR (as you have previously mentioned)

@mdhaber
Copy link
Contributor Author

mdhaber commented Dec 31, 2021

No offense taken! I'm a little disappointed that it wasn't faster with the wrapper than without, actually, since it had been faster in the past. But yeah, at some point I'll profile this and see what can be sped up. One thing we've talked about is switching from np.vectorize to np.nditer to do the looping. That in itself might not speed things up, but it would make it more natural to use the NaN and masked-array masks for the whole array rather than calculating them for each for every axis-slice. Originally np.vectorize seemed like a good idea, but I think np.nditer really might be the way to go.

@mdhaber
Copy link
Contributor Author

mdhaber commented Jan 12, 2022

I think this is ready when you are @Kai-Striega @tirthasheshpatel!

Copy link
Member

@tirthasheshpatel tirthasheshpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another pass through the changes and they look good. Tests are very comprehensive and strong too. I also had some time to experiment with gmean (mostly checking if it works with broadcastable shape arrays and nan_policy='omit') and it looks good too. There have been no new comments for some time so I think it is safe to merge. Thanks @mdhaber, @Kai-Striega @tupui!

@tirthasheshpatel tirthasheshpatel merged commit 465da54 into scipy:main Jan 12, 2022
@mdhaber
Copy link
Contributor Author

mdhaber commented Jan 12, 2022

Thanks, all - and @tupui, too! On to hmean.

@tupui
Copy link
Member

tupui commented Jan 12, 2022

Thanks everyone good to see this in 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants