-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
MAINT/ENH: Support for weights and range when estimating optimal number of bins #6288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mn, mx = data_range | ||
keep = (a >= mn) | ||
keep &= (a <= mx) | ||
if not np.logical_and.reduce(keep): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use np.all
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I coped the filter step from https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py#L400 which came up in #6100.
It's actually faster than np.all
, though for style reasons, I'm happy to use np.all too
The design choices here look good to me. |
Frankly, I am a bit sceptical about automatically guessing the weight type. Maybe it is clear enough, just wanted to say that it has a bit of a bad taste to me for starters. That said, I think the functionality is not bad at all. |
I agree that choosing data_size from weights is far from perfect, but without further input from the user ( Any better solutions, I'm all ears. If nothing else, we could simply patch in the |
I am not currently convinced that you can always correctly infer the type. So I am still wondering if we cannot find a way to force the user to make a conscies choice. |
Fair enough - the way I see it, we need to know 2 things
Basically, we need to know how to deal with datasize - if it's None, we can ignore the weights when estimating, if it's 'sum' or 'size', we know that we have to deal with weights. I.e. 1 argument. Possible solutions,
I'm in favour of 3 (or maybe 1), as it requires the least amount of API changes, and it's an incremental change. |
Would it make sense to add a new kwarg data_size whose value defaults to
|
I'm not overly keen on adding a new kwarg that only applies when a bins estimator is being used and |
I'm not keen on having lots of non-orthogonal kwargs either, but I like it
|
If there are only two options for how to deal with the weights, maybe just do a look up on the whole string. You effectively double/triple the size of the number of estimators, but that does not seem so bad. From the mpl side, having |
I wonder if it would make sense to pick
I guess this would interact poorly with downstream libraries like pandas that use floats to represent nullable-int. |
Is using data.size even theoretically justified? I can't quite think of any On Wed, Sep 9, 2015 at 11:41 PM, Stephan Hoyer [email protected]
Nathaniel J. Smith -- http://vorpus.org |
Haha, inferring The estimator is primarily concerned with the size of the data to minimise Mean Integrated Squared Error. In general optimal binwidth ~ n^(-1/3) where n is the datasize. Possible situations
We could potentially turn |
Yeah, the whole discussion is because I am unhappy with guessing what to
do with weights based on the type. I have the feeling people are not
necessary enough aware of their type. And accuracy weights might add up
to more then the sample size in some cases as well....
|
I would like to throw in one further option, which may or may not be temporary: Ignore the problem ;), maybe by exposing the bin number functions. Then just throw an error and tell user to use those functions manually. If weights are not very common (I have no clue) for plotting, we could do that reasonably. |
I like @seberg's last suggestion (ignore/raise) the best. On Thu, Sep 10, 2015 at 7:08 AM seberg [email protected] wrote:
|
So just raising an error is weights are set? That seems reasonable to me, better than designing an awkward API for niche uses. On Thu, Sep 10, 2015 at 10:18 AM, Thomas A Caswell
|
It would be nice to support count weights + auto binning at some point, but On Thu, Sep 10, 2015 at 11:08 AM, Stephan Hoyer [email protected]
Nathaniel J. Smith -- http://vorpus.org |
Ok, in that case, I'll edit the PR to support the With regards to exposing the estimators, would it find a better home in |
It is true that exposing would fit better into scipy. And I am not a fan of exposing them in top level numpy. But scipy is not a dependency so that would be code duplication. |
1cfd81b
to
e9f13a8
Compare
Removed support for weights, and instead raise a TypeError when weights!=None is passed into the function. Support for range still remains, and the test is included too. |
@@ -91,6 +101,17 @@ def _hist_optim_numbins_estimator(a, estimator): | |||
if a.size == 0: | |||
return 1 | |||
|
|||
if data_weights is not None: | |||
raise TypeError("Automated estimation of the number of " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a ValueError? What's the best error to raise here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a RuntimeWarning
? Something along the lines of FD estimator currently ignores weighted data
. Might be better than crashing.
DOC: typo: change 'pubic' to 'public'.
ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base
☔ The latest upstream changes (presumably #7090) made this pull request unmergeable. Please resolve the merge conflicts. |
Wait, did we never put some kind of warning/error about using this with weights in, sounds like this should be done still!? @madphysicist maybe you want to have a bit of a look at this, or @nayyarv has time to finish this off? |
I skimmed this and the related (closed) PR. I will look into it in more detail as soon as I get a chance. The first thing that comes to mind is to add a |
MAINT: update doc/sphinxext to numpydoc 0.6.0, and fix up some docstrings
Actually, I meant mostly if we are doing something reasonable with the |
@seberg: Yeah, I think you're right -- it looks like in master and 1.11b3, the new @nayyarv @madphysicist: Any chance one of you could clean this up quickly so we can get it into 1.11? |
DEP: Deprecate as_strided returning a writable array as default
Cancel my previous comment. I took a look at the code more carefully. I can clean this up right now actually. Can I just rebase it onto #7199? |
…l number of bins and associated tests
…imeWarnings instead.
# Conflicts: # numpy/lib/function_base.py # numpy/lib/tests/test_function_base.py
I have rebased the @nayyarv, FWIW, I like the version in the |
Looks like the other PR kicked off a rebuild on this one? |
Nup, this was me trying to fix it, but looks like the rebase went a bit crazy. |
Looks like you may have rebased master onto your branch instead of the other way around? |
If you are OK with my PR, we can just run with that. I did not change any of the functionality, just some wording and whitespace stuff. |
yep, that's exactly what happened. Let's jump over to your PR instead, I'm shutting this down. |
Jump over to #7243, this pull request has gotten out of hand with my failure to git properly. |
You can recover from a failed rebase, |
Thanks charris, I did manage to fix the rebase but since I closed this it, github won't let me reopen this Pull Request (since it's been recreated). As such, since madphysicist managed the exact same effect thing in his PR, I'm happy to let his go through instead |
Following on from PR #6029, the methods provided ignore range or weights so this PR aims to address this.
The estimators are still defined as separate functions if there is ever a future need to expose these estimators to users. Each estimator now takes x and weights as arguments. These subfunctions expectation of
weights
is that it is count-like. If it's a probability, the subfunctions won't choose the datasize properly._hist_optim_numbins_estimator
handles the range keyword by masking the appropriate parts ofa
andweights
. It also checks whether weights is of an appropriate dtype (int or float). If weights is probability-like (i.e.weights.sum()<a.size
), I scale weights such thatweights.sum()=a.size
. This means that I effectively choosen=max(a.size, weights.sum())
for my datasize, allowing for decent behaviour whether weights is count-like or probability like, while keeping things simple.Freedman Diaconis requires a weighted IQR to be calculated, and since
np.percentile
ornp.partition
does not provide weighted options, this method is only used for unweighted data. Trying to use the 'FD' estimator with weights results in aTypeError
. 'auto' now chooses between 'sturges' and 'scott' for weighted data, while retaining original behaviour for unweighted data.I've put some basic tests together and minor changes to the documentation and would love to hear others thoughts and feedback on the design choices/implementation/style etc.