-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
ENH: stats: add axis tuple and nan_policy to gmean
#14657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Here is a script that helps check some of the items in the list above. Detailsimport numpy as np
from scipy import stats
fun = stats.gmean
def try_case(x, message, kwds):
try:
print(message)
print(fun(x, **kwds))
except Exception as e:
print(f"{type(e)}: {str(e)}")
print("----")
random_array = np.random.rand(4, 6)
cases = [
([], 'Zero Observations:', {}),
([1.], 'One Observation:', {}),
([1., 2.], 'Two Observations', {}),
(random_array, 'Array, no axis', {}),
(random_array, 'Array, axis=0', {'axis': 0}),
(random_array, 'Array, axis=1', {'axis': 1}),
([np.nan, 1., 2., 3.], 'No nan_policy', {}),
([np.nan, 1., 2., 3.], 'raise', {'nan_policy': 'raise'}),
([np.nan, 1., 2., 3.], 'propagate', {'nan_policy': 'propagate'}),
([np.nan, 1., 2., 3.], 'omit', {'nan_policy': 'omit'}),
([np.nan, np.nan, np.nan], 'omit empty', {'nan_policy': 'omit'}),
([np.nan, np.nan, 1.], 'omit all but one', {'nan_policy': 'omit'}),
]
for case in cases:
try_case(*case)In master, the output for DetailsIn this branch, the output for Details
For examples of the sort of variety we're going to experience - and why we should go through these carefully - try this script on the following functions.
|
tupui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, another alternative could have been to use a list of indices for the output. But declaring this using lambda makes it also clear. I don't have strong opinion here so not requesting anything.
axis and nan_policy to single-output summary statisticsaxis and nan_policy to gmean
Co-authored-by: Pamphile ROY <[email protected]>
|
I think I'll just go straight down the list. Next up would be
Or can we do something simpler? |
|
@tupui @Kai-Striega @tirthasheshpatel @V0lantis A few options for how the decorator might handle masked arrays come to mind:
I don't really want to do 1. Part of this effort is to move away from the use of masked arrays in the internals of 2 is OK, but then we'd be dragging a mask around in parallel with the data array. I think this would be a bit more work than option 3. The trouble with 3 is that if Let me know what you think! |
|
I think both options 2 and 3 would be similar no? Being a decorator, we can pre-process the data before and have them back after the way we want. I would be more for option 2 than 3 I think. It just seem to introduce less code to manage a fictive value. |
V0lantis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kai-Striega @V0lantis Look good to you?(See next comment.)I think I'll just go straight down the list. Next up would be
hmean, and the funny thing we'll have to decide on there is that it currently returns aValueErrorwhen passed NaNs. Do you agree with the suggestion:I think in this case the default
nan_policyshould beNone, we deprecate the use without explicitnan_policy, and then we can drop the deprecation warning and change the default to'propagate'like everything else in a few versions.Or can we do something simpler?
Yeah, I think this is the way to go. We could also directly go with the default nan_policy=propagate, raising a warning not to surprise anyone, and see if we see any complaints, but that might be a little bit harsh on those who rely on the code.
Also, I tested locally the vectorization of Kruskal, so that I could understand better how the whole code works.
I had to fulfill all arrays with np.nan. Would there be a futur upgrade in the futur where we could maybe directly pass X and Y directly to Kruskal (or every other functions supporting axis param)?
import numpy as np
from scipy import stats
# Need to define each lines with np.array, to e able to assign directly to the concatenate array
X = [
np.array([1, 3, 5, 7, 9]),
np.array([1]),
np.array([1, 1, 1, 2]),
np.array([1, 1, 1]),
]
Y = [
np.array([2, 4, 6, 8, 10]),
np.array([1, 2]),
np.array([2, 2, 2, 2]),
np.array([2, 2, 2]),
]
def broadcasting_kruskal():
x = np.array([1, 3, 5, 7, 9], )
y = np.array([2, 4, 6, 8, 10], )
X_concatenate = np.concatenate(
(x.reshape(1, -1), np.full((len(X) - 1, x.shape[
0]), np.nan)))
Y_concatenate = np.concatenate(
(y.reshape(1, -1), np.full((len(X) - 1, y.shape[
0]), np.nan)))
for index, (a, b) in enumerate(zip(X[1:], Y[1:])):
X_concatenate[index + 1, :a.shape[0]] = a
Y_concatenate[index + 1, :b.shape[0]] = b
return stats.kruskal(X_concatenate, Y_concatenate, nan_policy="omit",
axis=1)
def kruskcal_with_for_loop():
for a, b in zip(X, Y):
yield stats.kruskal(a, b)
print(broadcasting_kruskal())
print(list(kruskcal_with_for_loop()))It is how it is supposed to work?
Unfortunately, I noticed that the vectorized solution is slower than the for ... in loop :
%timeit broadcasting_kruskal()
1.92 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit kruskcal_with_for_loop()
219 ns ± 6.93 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)Has it something to do with np.ndenumerate, that you mention in your comment in gh-14651
|
Stats functions that are vectorized are not intended to support jagged arrays. You can pass in just If you want to do computations with jagged arrays, you should use a |
|
I'm a little bit behind with the discussion so I'll try to address a few things now and then catch up tomorrow
+1
Wouldn't this require a significant performance cost for anything except very small arrays? Although I don't know if it's relevant to this particular PR, the decorator still uses |
Why? Or, what part of it? If there are no NaNs and no masks, then there is no overhead because this stuff can be skipped. There are many strategies I can think of which would have some small overhead (which is a price we are already paying for the huge improvement in For finding a valid sentinel value, for example: if there are no For applying the sentinel value, there is not much cost, e.g. For eliminating sentinel values, there is not much more overhead than we already have to check for nans and remove those. In any case, I think even moderate overhead if there are NaNs/masks is acceptable for the interface improvement, because the overhead can (in principle) be reduced, and the overhead we add is offset by the improvement moving away from masked arrays. Try, for example: import numpy as np
n = 1000
x = np.random.rand(n)
mask = np.random.rand(n) > 0.5
y = np.ma.masked_array(x, mask=mask)
%timeit np.mean(x[~mask]) # 9.89 µs ± 155 ns per loop
%timeit np.mean(y) # 21.7 µs ± 185 ns per loopI think initially we can focus on correctness and test coverage so that we have a really thorough suite if/when we optimize performance.
I don't think that's true after this PR. Searching the modified |
|
Thanks for the feedback, everyone. Re @tupui:
The ideas are similar. The code is quite different. Option 2 would require the same sort of modifications as the earlier suggestion to check the inputs for NaNs once (per input) rather than once per axis-slice. For that reason, I'm going to proceed with Option 3, which is much simpler to implement. Later, in an independent PR, we can investigate the one-time check for NaNs (and Option 2 for handling masked arrays) to see if that would be more efficient. Re @V0lantis:
In your time comparison; you did not iterate over it your generator; you only instantiated it. The difference in execution times is actually negligible. Also, it would have been a little less inconvenient (and more efficient) to set up the NaN-filled arrays by starting with all-NaN arrays and filling them with the data like: shape = len(X), max(len(x) for x in X)
X2 = np.full(shape, np.nan)
Y2 = np.full(shape, np.nan)
for i in range(len(X)):
x, y = X[i], Y[i]
X2[i, :len(x)] = x
Y2[i, :len(y)] = y |
I was thinking of finding a good sentinel value (by repeatedly taking a random number and checking if it's in the array) would be very inefficient. Maybe I was just tired and didn't think of the obvious way. Taking the max +1 (or similar) seems good. Looking at the profiling above it looks like my suspicions were wrong anyway.
+1
+1
You're right again. I was experimenting with the code from master on my local machine and it stuck out, then I looked at the PR and saw it in the docs/tests and though it would be worth bringing up. The code also still uses Going with option 3 looks like a good idea. I think it will be best to wait for the decorator to stabilize before I attempt to apply it to any new functions. |
Thank you, effectively, you are right, nice spotted here. For the rest, I think @Kai-Striega summered it very well. Looks very good to me |
|
Sorry, it took me some time to get to this. I reviewed both this and #13312
I think option 3 sounds best right now. Option 1 will be much more work and stall an important improvement. Option 2 is good but would require some more code. I think one of the ways to tackle the problem of sentinel value being This should also not introduce much overhead and get the work done, I guess. Does this work, @mdhaber? |
Yes, I considered that. But I think that if we're already writing a routine to find/use a separate sentinel value for I'll think about it again when I'm working on it. There might be a delay since the school year is starting and I'm trying to get gh-13490 done. |
|
I've been thinking about good ways to find the sentinel value a bit lately. Taking max + 1 seems to do slightly more work than we need to. A simpler approach would be using the inbuilt |
|
(School is starting and it's a very busy time, so I've needed to put this on hold a bit.) |
Sorry. I didn't mean to imply that you need to be working on this. I was working on something else and thought of this and wanted to write it down before my mind moved onto the next thing.
I didn't think about that. I guess it will be different depending on how the user handles values that are too large.
I'm not sure that we will need anything non-deterministic, perhaps just using max - eps until we have a value that is [not] included. I don't think the likelihood of data containing max - 3*eps (or 17, or any other number) is going to be higher than any other random number. |
+1 from me. Let's just make sure we keep track of these ad hoc decisions somewhere so that they actually get implemented before this is all wrapped up. |
@Kai-Striega Yup, this is a comment in the code: scipy/scipy/stats/_axis_nan_policy.py Lines 359 to 362 in 75192e2
|
|
It's taking me a while to read into this, but I've started now and I should get a full review in the next day or two. I wanted to check if there was anything specific you'd like me to focus on? @mdhaber |
|
@Kai-Striega Thanks! Sorry it's complicated. If you're going to look at code, I'd suggest focusing on the changes since @tupui approved. I think the most valuable things would be to check 1-5 above and that Another thing you might want to confirm is that the import numpy as np
from scipy.stats import gmean
rng = np.random.default_rng(0)
a = rng.random(5)
weights = rng.random(5)
mask_a = [True, False, False, False, True]
mask_weights = [False, False, True, False, False]
a_masked = np.ma.masked_array(a, mask = mask_a)
weights_masked = np.ma.masked_array(weights , mask = mask_weights)
# this is the mask that _should_ be applied to both `a` and `weights`
joint_mask = np.logical_or(mask_a, mask_weights )
# see, for example, the behavior of np.ma.average
avg1 = np.ma.average(a_masked, weights=weights_masked)
avg2 = np.average(a[~joint_mask], weights=weights[~joint_mask])
np.testing.assert_allclose(avg1, avg2)
# gmean is wrong in master, fixed by PR
res1 = gmean(a_masked, weights=weights_masked)
res2 = gmean(a[~joint_mask], weights=weights[~joint_mask])
np.testing.assert_allclose(res1, res2)The problem was that |
It looks like there are still some problems when import numpy as np
from scipy.stats import gmean
if __name__ == "__main__":
rng = np.random.default_rng(0)
SIZE = (50, 15)
P_TRUE = 0.1
a = rng.random(size=SIZE)
weights = rng.random(size=SIZE)
mask_a = rng.choice([True, False], size=SIZE, p=[P_TRUE, 1 - P_TRUE])
mask_weights = rng.choice([True, False], size=SIZE, p=[P_TRUE, 1 - P_TRUE])
a_masked = np.ma.masked_array(a, mask=mask_a)
weights_masked = np.ma.masked_array(weights , mask=mask_weights)
# this is the mask that _should_ be applied to both `a` and `weights`
joint_mask = np.logical_or(mask_a, mask_weights)
# see, for example, the behavior of np.ma.average
avg1 = np.ma.average(a_masked, weights=weights_masked)
avg2 = np.average(a[~joint_mask], weights=weights[~joint_mask])
np.testing.assert_allclose(avg1, avg2)
# gmean is wrong in master, fixed by PR
res1 = gmean(a_masked, weights=weights_masked)
res2 = gmean(a[~joint_mask], weights=weights[~joint_mask])
np.testing.assert_allclose(res1, res2) |
|
@Kai-Striega res1 = gmean(a_masked, weights=weights_masked, axis=-1)
res2 = []
for a_row, weights_row, mask_row in zip(a, weights, joint_mask):
res2.append(gmean(a_row[~mask_row], weights=weights_row[~mask_row]))
np.testing.assert_allclose(res1, res2)
res1 = gmean(a_masked, weights=weights_masked, axis=0)
res2 = []
for a_row, weights_row, mask_row in zip(a.T, weights.T, joint_mask.T):
res2.append(gmean(a_row[~mask_row], weights=weights_row[~mask_row]))
np.testing.assert_allclose(res1, res2) |
|
I think this is working very well.
import timeit
if __name__ == "__main__":
setup = """
import numpy as np
from scipy.stats import gmean
rng = np.random.default_rng(0)
SIZE = (500_000, 10)
P_TRUE = 0.1
a = rng.random(size=SIZE)
weights = rng.random(size=SIZE)
mask_a = rng.choice([True, False], size=SIZE, p=[P_TRUE, 1 - P_TRUE])
mask_weights = rng.choice([True, False], size=SIZE, p=[P_TRUE, 1 - P_TRUE])
a_masked = np.ma.masked_array(a, mask=mask_a)
weights_masked = np.ma.masked_array(weights , mask=mask_weights)
"""
stmt = """
res1 = gmean(a_masked, weights=weights_masked)
"""
time1 = timeit.repeat(stmt=stmt, setup=setup, number=5, repeat=5)
print(time1)on my machine this returns |
So yes, currently. But I'd propose that we deprecate the
Yes, there is some. I ran your test using %timeit in master: 163 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) For reference, you can also use argument In the past, I found cases where the decorated version was faster than the |
Do you plan on deprecating more than just
Going from 164 to 260 seems like a significant increase. I think some overhead is acceptable (or, at least, unavoidable), but we should keep an eye on it in-case it starts to grow.
+1 |
Of course! I'm not deprecating anything yet. But that sort of change would definitely warrant discussion on the mailing list.
I think it's worth it, considering that the output of master's With the manual fix to But yes, for now, I'm suggesting that we trade some
And I'm pretty confident we can optimize to make the decorator faster than any code that uses mased arrays (at least in cases where the number of elements per axis-slice is big compared to the number of axis slices) because masked arrays are so slow. If we manually loop over the columns in your example, e.g. def f():
res2 = np.empty(10)
for i in range(10):
mask = ~(a_masked[:, i].mask | weights_masked[:, i].mask)
a_compressed = a_masked.data[:, i][mask]
weights_compressed = weights_masked.data[:, i][mask]
res2[i] = gmean(a_compressed, weights=weights_compressed, _no_deco=True)
return res2I get Looping over the rows is a different story, but of course we could always make the decorator smart and have it make use of the masked array capabilities of |
|
In comment #14657 (comment)
After reviewing the code, I think the second option is better (not just ad hoc) than the first because it simply works with stats functions that take arbitrary number of samples (i.e. functions |
tirthasheshpatel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little late to this but went through all the changes and comments. Changes in the decorator look good; tests are very strong. I still need to experiment a little but otherwise, it has my approval. As @Kai-Striega has already experimented, feel free to merge once you think this is ready.
Agreed, and that's one of the reasons it requires fewer changes. There would need to be some sort of special case to handle Right now, we still have to treat special cases. The usual case is that the function takes a fixed number of samples as its first I think a single argument, Anyway, I'm happy with the approach as-is for now with just a note in the code about the idea: scipy/scipy/stats/_axis_nan_policy.py Lines 359 to 363 in 82675ea
Thanks! |
axis and nan_policy to gmeanaxis tuple and nan_policy to gmean
|
@tirthasheshpatel Added that note about |
Sorry if that came across poorly. I think it's worth it too. I think it came across differently in my original comment which could have been phrased more positively.
Sounds good. I'd be happy to look at the optimizations but let's keep that in it's own PR (as you have previously mentioned) |
|
No offense taken! I'm a little disappointed that it wasn't faster with the wrapper than without, actually, since it had been faster in the past. But yeah, at some point I'll profile this and see what can be sped up. One thing we've talked about is switching from |
|
I think this is ready when you are @Kai-Striega @tirthasheshpatel! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had another pass through the changes and they look good. Tests are very comprehensive and strong too. I also had some time to experiment with gmean (mostly checking if it works with broadcastable shape arrays and nan_policy='omit') and it looks good too. There have been no new comments for some time so I think it is safe to merge. Thanks @mdhaber, @Kai-Striega @tupui!
|
Thanks, all - and @tupui, too! On to |
|
Thanks everyone good to see this in 😃 |

Reference issue
gh-14651
What does this implement/fix?
This adds parameters
axisandnan_policytogmean.Note: initially, only added to
gmean. There is so much variety in what these functions currently do (e.g. with small samples, when nd-arrays are passed withoutaxis, etc.) that we realy should do these one at a time.A few important things to think about with functions like these are 1) whether applying the decorator breaks backward compatibility, 2) whether applying the decorator reduces the speed considerably, and 2) whether the wrapper's implementation of
nan_policy='propagate'makes sense for the function. Some specific things to look for:default_axisshould beNone, and perhaps this should be deprecated so we can start usingdefault_axis=0like most functions.)nan_policy, what was the behavior in case of NaNs? Did it propagate NaNs correctly, and does the new defaultnan_policy='propagate'propagate NaNs correctly?axisandnan_policyarguments before, does it still? Has the documentation been updated correctly?