MAINT/ENH: Support for weights and range when estimating optimal number of bins #6288

nayyarv · 2015-09-07T13:30:09Z

Following on from PR #6029, the methods provided ignore range or weights so this PR aims to address this.

The estimators are still defined as separate functions if there is ever a future need to expose these estimators to users. Each estimator now takes x and weights as arguments. These subfunctions expectation of weights is that it is count-like. If it's a probability, the subfunctions won't choose the datasize properly.

_hist_optim_numbins_estimator handles the range keyword by masking the appropriate parts of a and weights. It also checks whether weights is of an appropriate dtype (int or float). If weights is probability-like (i.e. weights.sum()<a.size), I scale weights such that weights.sum()=a.size. This means that I effectively choose n=max(a.size, weights.sum()) for my datasize, allowing for decent behaviour whether weights is count-like or probability like, while keeping things simple.

Freedman Diaconis requires a weighted IQR to be calculated, and since np.percentile or np.partition does not provide weighted options, this method is only used for unweighted data. Trying to use the 'FD' estimator with weights results in a TypeError. 'auto' now chooses between 'sturges' and 'scott' for weighted data, while retaining original behaviour for unweighted data.

I've put some basic tests together and minor changes to the documentation and would love to hear others thoughts and feedback on the design choices/implementation/style etc.

shoyer · 2015-09-07T20:50:29Z

numpy/lib/function_base.py

+        mn, mx = data_range
+        keep = (a >= mn)
+        keep &= (a <= mx)
+        if not np.logical_and.reduce(keep):


use np.all instead

I coped the filter step from https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py#L400 which came up in #6100.

It's actually faster than np.all, though for style reasons, I'm happy to use np.all too

shoyer · 2015-09-07T21:03:37Z

The design choices here look good to me.

seberg · 2015-09-07T23:39:09Z

Frankly, I am a bit sceptical about automatically guessing the weight type. Maybe it is clear enough, just wanted to say that it has a bit of a bad taste to me for starters.

That said, I think the functionality is not bad at all.

nayyarv · 2015-09-08T08:17:04Z

I agree that choosing data_size from weights is far from perfect, but without further input from the user (**kwargs or another argument in histogram)/exposing the estimators, it's the best I can do.
After our discussion at the end of the last PR, we decided the two main use cases for weights + automatic bin estimation were either count-like or probability-like weights (matched up to friends/colleagues thoughts)
If the weights were count-like, it's sum would be greater than x.size and if it was probability-like, it's sum would be less, which is why I use the max(a.size, weights.sum()) logic.
If the user was using the weights option as a bin summation method, it's unlikely they would use automatic estimators, and rather fixed bin edges - so situations in which the weights are not count/probability like can be ignored.

Any better solutions, I'm all ears. If nothing else, we could simply patch in the range support and deal with weights at a later date.

seberg · 2015-09-08T14:26:11Z

I am not currently convinced that you can always correctly infer the type. So I am still wondering if we cannot find a way to force the user to make a conscies choice.

nayyarv · 2015-09-09T08:01:41Z

Fair enough - the way I see it, we need to know 2 things

Whether or not to consider the weights when estimating the number of bins
If we do consider the weights, what should the data size be - weights.sum() or a.size.

Basically, we need to know how to deal with datasize - if it's None, we can ignore the weights when estimating, if it's 'sum' or 'size', we know that we have to deal with weights. I.e. 1 argument.

Possible solutions,

Expose the estimators for manual control and let np.histogram provide simple default behaviour where it tries to guess (Maybe move the estimators to scipy.stats?)
Add another argument to np.histogram for instructions on how to deal with weights, weights_sum=True or something.
Add options to the bins string, like 'auto sum' or 'scott size', which can then be split to retrieve the estimator and how to deal with weights. If the second term isn't included, ignore weights in the estimation. This also means the bins checking logic remains simple, as opposed to making it a tuple or list which is also iterable.
Something similar to 3, except maybe combined into the weights argument as a tuple or iterable?

I'm in favour of 3 (or maybe 1), as it requires the least amount of API changes, and it's an incremental change.

njsmith · 2015-09-09T08:16:27Z

Would it make sense to add a new kwarg data_size whose value defaults to
weights.sum()?
On Sep 9, 2015 1:01 AM, "Varun Nayyar" [email protected] wrote:

Fair enough - the way I see it, we need to know 2 things

Whether or not to consider the weights when estimating the number of
bins

If we do consider the weights, what should the data size be -
weights.sum() or a.size.

Basically, we need to know how to deal with datasize - if it's None, we
can ignore the weights when estimating, if it's 'sum' or 'size', we know
that we have to deal with weights. I.e. 1 argument.

Possible solutions,

Expose the estimators for manual control and let np.histogram provide
simple default behaviour where it tries to guess (Maybe move the estimators
to scipy.stats?)

Add another argument to np.histogram for instructions on how to deal
with weights, weights_sum=True or something.

Add options to the bins string, like 'auto sum' or 'scott size', which
can then be split to retrieve the estimator and how to deal with weights.
If the second term isn't included, ignore weights in the estimation. This
also means the bins checking logic remains simple, as opposed to making
it a tuple or list which is also iterable.

Something similar to 3, except maybe combined into the weights
argument as a tuple or iterable?

I'm in favour of 3 (or maybe 1), as it requires the least amount of API
changes, and it's an incremental change.

—
Reply to this email directly or view it on GitHub
#6288 (comment).

nayyarv · 2015-09-09T13:30:59Z

I'm not overly keen on adding a new kwarg that only applies when a bins estimator is being used and weights is not None. Seems like overkill, and I would like to avoid R style functions with their mostly unused arguments.

njsmith · 2015-09-09T15:57:17Z

I'm not keen on having lots of non-orthogonal kwargs either, but I like it
a lot better than the idea of defining a little ad hoc string language
inside a single string kwarg. The other general option that comes to mind
would be using something that has more structure than a str, like a dict or
the object, but I'm not seeing a terribly natural way to apply that trick
here myself...
On Sep 9, 2015 6:31 AM, "Varun Nayyar" [email protected] wrote:

I'm not overly keen on adding a new kwarg that only applies when a bins
estimator is being used and weights is not None. Seems like overkill, and
I would like to avoid R style functions with their mostly unused arguments.

—
Reply to this email directly or view it on GitHub
#6288 (comment).

tacaswell · 2015-09-09T16:02:07Z

If there are only two options for how to deal with the weights, maybe just do a look up on the whole string. You effectively double/triple the size of the number of estimators, but that does not seem so bad.

From the mpl side, having histogram gain any kwargs would greatly complicate the argument handling in hist.

shoyer · 2015-09-10T06:41:29Z

I wonder if it would make sense to pick data_size based on the dtype of the array:

int: use data.sum()
float: use data.size

I guess this would interact poorly with downstream libraries like pandas that use floats to represent nullable-int.

njsmith · 2015-09-10T07:21:06Z

Is using data.size even theoretically justified? I can't quite think of any
reason why it's a more natural choice than any other arbitrary number
(though I admit my intuition about these algorithms is very limited).

On Wed, Sep 9, 2015 at 11:41 PM, Stephan Hoyer [email protected]
wrote:

I wonder if it would make sense to pick data_size based on the dtype of
the array:

int: use data.sum()

float: use data.size

I guess this would interact poorly with downstream libraries like pandas
that use floats to represent nullable-int.

—
Reply to this email directly or view it on GitHub
#6288 (comment).

Nathaniel J. Smith -- http://vorpus.org

nayyarv · 2015-09-10T08:22:43Z

Haha, inferring data_size from the dtype is what we're unhappy with, so we're looking at a way to get the user to give us input without changing function signatures too much.

The estimator is primarily concerned with the size of the data to minimise Mean Integrated Squared Error. In general optimal binwidth ~ n^(-1/3) where n is the datasize.

Possible situations

Now if weights represents counts, this means that np.repeat(a, weights) would give us the full data set and hence n should be weights.sum(). This makes sense for integer weights and some float weights (if there's reasoning behind it).
2.In a situation where the weights represents a probability/proportion, the calculation of standard deviation or the IQR for scott's and FD's rule respectively are now altered, while the datasize should be the number of points, not weight.sum() which would evaluate to 1.
Finally, if there is no meaning to the weights, it's simply for the summation, the weights should be ignored for the calculation of the standard deviation and the datasize. This is for something exotic like complex weights or you have matched pairs eg. (Age, Wealth) that you want to sum up per group.

We could potentially turn bins into a dict (or a list or tuple) for automatic methods, i.e,
{method='auto', weights=['sum', 'size', 'ignore']} for example, and for bins='auto' we either ignore weights by default or try and guess (max(weights.sum(), a.size) for data_size) for best average case behaviour.

seberg · 2015-09-10T10:52:56Z

Yeah, the whole discussion is because I am unhappy with guessing what to do with weights based on the type. I have the feeling people are not necessary enough aware of their type. And accuracy weights might add up to more then the sample size in some cases as well....

seberg · 2015-09-10T11:08:20Z

I would like to throw in one further option, which may or may not be temporary:

Ignore the problem ;), maybe by exposing the bin number functions. Then just throw an error and tell user to use those functions manually. If weights are not very common (I have no clue) for plotting, we could do that reasonably.

tacaswell · 2015-09-10T17:18:38Z

I like @seberg's last suggestion (ignore/raise) the best.

On Thu, Sep 10, 2015 at 7:08 AM seberg [email protected] wrote:

I would like to throw in one further option, which may or may not be
temporary:

Ignore the problem ;), maybe by exposing the bin number functions. Then
just throw an error and tell user to use those functions manually. If
weights are not very common (I have no clue) for plotting, we could do that
reasonably.

—
Reply to this email directly or view it on GitHub
#6288 (comment).

shoyer · 2015-09-10T18:08:49Z

So just raising an error is weights are set? That seems reasonable to me, better than designing an awkward API for niche uses.

On Thu, Sep 10, 2015 at 10:18 AM, Thomas A Caswell
[email protected] wrote:

I like @seberg's last suggestion (ignore/raise) the best.
On Thu, Sep 10, 2015 at 7:08 AM seberg [email protected] wrote:

I would like to throw in one further option, which may or may not be
temporary:

Ignore the problem ;), maybe by exposing the bin number functions. Then
just throw an error and tell user to use those functions manually. If
weights are not very common (I have no clue) for plotting, we could do that
reasonably.

—
Reply to this email directly or view it on GitHub
#6288 (comment).

Reply to this email directly or view it on GitHub:
#6288 (comment)

njsmith · 2015-09-10T23:24:19Z

It would be nice to support count weights + auto binning at some point, but
we can always do that as an enhancement later once we've figured out how to
actually make it work right :-). And until then, raising an error is safe
way to keep our options open, so +1

On Thu, Sep 10, 2015 at 11:08 AM, Stephan Hoyer [email protected]
wrote:

So just raising an error is weights are set? That seems reasonable to me,
better than designing an awkward API for niche uses.

On Thu, Sep 10, 2015 at 10:18 AM, Thomas A Caswell
[email protected] wrote:

I like @seberg's last suggestion (ignore/raise) the best.
On Thu, Sep 10, 2015 at 7:08 AM seberg [email protected] wrote:

I would like to throw in one further option, which may or may not be
temporary:

Ignore the problem ;), maybe by exposing the bin number functions. Then
just throw an error and tell user to use those functions manually. If
weights are not very common (I have no clue) for plotting, we could do
that
reasonably.

—
Reply to this email directly or view it on GitHub
#6288 (comment).

Reply to this email directly or view it on GitHub:
#6288 (comment)

—
Reply to this email directly or view it on GitHub
#6288 (comment).

Nathaniel J. Smith -- http://vorpus.org

nayyarv · 2015-09-11T07:08:06Z

Ok, in that case, I'll edit the PR to support the range keyword only and throw an error for weights not None. R doesn't provide an option for weights with it's hist method (which is primarily visual), and I'm not sure how many situations with 'auto' bin estimation will have weights too. I suppose we'll find out after 1.11 is released.

With regards to exposing the estimators, would it find a better home in scipy.stats instead?

seberg · 2015-09-11T08:19:17Z

It is true that exposing would fit better into scipy. And I am not a fan of exposing them in top level numpy. But scipy is not a dependency so that would be code duplication.
So, I know I suggested that we may expose it, but I frankly have no clue how it would look like :(.

nayyarv · 2015-09-22T10:09:40Z

Removed support for weights, and instead raise a TypeError when weights!=None is passed into the function. Support for range still remains, and the test is included too.

nayyarv · 2015-09-23T11:57:00Z

numpy/lib/function_base.py

@@ -91,6 +101,17 @@ def _hist_optim_numbins_estimator(a, estimator):
    if a.size == 0:
        return 1

+    if data_weights is not None:
+        raise TypeError("Automated estimation of the number of "


Maybe a ValueError? What's the best error to raise here?

How about a RuntimeWarning? Something along the lines of FD estimator currently ignores weighted data. Might be better than crashing.

DOC: typo: change 'pubic' to 'public'.

ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base

homu · 2016-02-13T14:59:32Z

☔ The latest upstream changes (presumably #7090) made this pull request unmergeable. Please resolve the merge conflicts.

seberg · 2016-02-13T15:03:34Z

Wait, did we never put some kind of warning/error about using this with weights in, sounds like this should be done still!? @madphysicist maybe you want to have a bit of a look at this, or @nayyarv has time to finish this off?

madphysicist · 2016-02-13T18:53:58Z

I skimmed this and the related (closed) PR. I will look into it in more detail as soon as I get a chance. The first thing that comes to mind is to add a weightedfunctions module similar to nanfunctions. However, even at my most optimistic, I have to admit this is probably not a reasonable solution.

MAINT: update doc/sphinxext to numpydoc 0.6.0, and fix up some docstrings

seberg · 2016-02-13T20:08:55Z

Actually, I meant mostly if we are doing something reasonable with the
automatic bins right now. Error/warning ignore, I don't know, but just
to make sure it makes sense and is documented. We could try to make it
correct, but I expect it is not easy.

njsmith · 2016-02-13T20:58:45Z

@seberg: Yeah, I think you're right -- it looks like in master and 1.11b3, the new bins="auto" feature is kinda broken: np.histogram will happily accept bins="auto", weights=..., and then return nonsense results. Similarly for bins="auto", range=.... This PR looks like a good solution to me, but has conflicts due to other cleanups to this code...

@nayyarv @madphysicist: Any chance one of you could clean this up quickly so we can get it into 1.11?

DEP: Deprecate as_strided returning a writable array as default

madphysicist · 2016-02-14T02:40:14Z

Cancel my previous comment. I took a look at the code more carefully. I can clean this up right now actually. Can I just rebase it onto #7199?

…l number of bins and associated tests

…imeWarnings instead.

# Conflicts: # numpy/lib/function_base.py # numpy/lib/tests/test_function_base.py

madphysicist · 2016-02-14T04:25:48Z

I have rebased the autobins branch onto the latest master with all that that entails. I did add a couple of gramatical and line wrapping fixes. The result is PR #7243. I was unable to make a PR back to @nayyarv's fork of numpy.

@nayyarv, FWIW, I like the version in the autobinsWeighted branch better, even if this version is much simpler. Short-term, this is the best solution (although I would prefer to warn rather than raise an error). Medium-term, the autobinsWeighted solution is probably better. Long-term, I think adding a weightedfunctions module to do percentile, mean, median, etc. on weighted data would be optimal.

madphysicist · 2016-02-14T04:28:45Z

Looks like the other PR kicked off a rebuild on this one?

nayyarv · 2016-02-14T04:29:08Z

Nup, this was me trying to fix it, but looks like the rebase went a bit crazy.

madphysicist · 2016-02-14T04:31:38Z

Looks like you may have rebased master onto your branch instead of the other way around?

madphysicist · 2016-02-14T04:33:30Z

If you are OK with my PR, we can just run with that. I did not change any of the functionality, just some wording and whitespace stuff.

nayyarv · 2016-02-14T04:34:26Z

yep, that's exactly what happened. Let's jump over to your PR instead, I'm shutting this down.

nayyarv · 2016-02-14T04:39:35Z

Jump over to #7243, this pull request has gotten out of hand with my failure to git properly.

charris · 2016-02-14T05:15:02Z

You can recover from a failed rebase, git reflog is your friend, See http://stackoverflow.com/questions/134882/undoing-a-git-rebase.

nayyarv · 2016-02-14T06:55:00Z

Thanks charris, I did manage to fix the rebase but since I closed this it, github won't let me reopen this Pull Request (since it's been recreated). As such, since madphysicist managed the exact same effect thing in his PR, I'm happy to let his go through instead

shoyer reviewed Sep 7, 2015
View reviewed changes

nayyarv mentioned this pull request Sep 9, 2015

Should default hist() bins be changed in 2.0? matplotlib/matplotlib#4487

Closed

charris added 01 - Enhancement component: numpy.lib labels Sep 21, 2015

nayyarv force-pushed the autobins branch 2 times, most recently from 1cfd81b to e9f13a8 Compare September 22, 2015 10:04

nayyarv reviewed Sep 23, 2015
View reviewed changes

rgommers and others added 2 commits February 13, 2016 15:22

Merge pull request numpy#7240 from building39/feature/doc

858f2c4

DOC: typo: change 'pubic' to 'public'.

Merge pull request numpy#7090 from madphysicist/hist-estimators

a5c8529

ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base

rgommers added 3 commits February 13, 2016 16:08

DOC: update sphinxext to numpydoc 0.6.0

8a3bbde

DOC: fix a number of reST formatting issues in docstrings.

b9ae5a3

DOC: fix up invalid LaTeX in histogram docstring.

06169f1

Merge pull request numpy#7241 from rgommers/update-sphinxext

8f0d01f

MAINT: update doc/sphinxext to numpydoc 0.6.0, and fix up some docstrings

njsmith added this to the 1.11.0 release milestone Feb 13, 2016

seberg and others added 2 commits February 13, 2016 22:26

DOC: Add documentation for as_strided

f7e64cc

Merge pull request numpy#7105 from seberg/writeable-as-strided

26af0ce

DEP: Deprecate as_strided returning a writable array as default

nayyarv added 2 commits February 14, 2016 12:20

ENH: Adding support to the range keyword for estimation of the optima…

c4414c4

…l number of bins and associated tests

BUG: Change TypeError's raised by optimbins for weighted data to RunT…

473c852

…imeWarnings instead.

madphysicist mentioned this pull request Feb 14, 2016

ENH: Adding support to the range keyword for estimation of the optimal number of bins and associated tests #7243

Merged

Merge remote-tracking branch 'origin/autobins' into autobins

e4953ca

# Conflicts: # numpy/lib/function_base.py # numpy/lib/tests/test_function_base.py

nayyarv closed this Feb 14, 2016

charris removed this from the 1.11.0 release milestone Feb 19, 2016

Uh oh!

MAINT/ENH: Support for weights and range when estimating optimal number of bins #6288

MAINT/ENH: Support for weights and range when estimating optimal number of bins #6288

Uh oh!

Conversation

nayyarv commented Sep 7, 2015

Uh oh!

shoyer Sep 7, 2015

Choose a reason for hiding this comment

Uh oh!

nayyarv Sep 8, 2015

Choose a reason for hiding this comment

Uh oh!

shoyer commented Sep 7, 2015

Uh oh!

seberg commented Sep 7, 2015

Uh oh!

nayyarv commented Sep 8, 2015

Uh oh!

seberg commented Sep 8, 2015

Uh oh!

nayyarv commented Sep 9, 2015

Uh oh!

njsmith commented Sep 9, 2015

Uh oh!

nayyarv commented Sep 9, 2015

Uh oh!

njsmith commented Sep 9, 2015

Uh oh!

tacaswell commented Sep 9, 2015

Uh oh!

shoyer commented Sep 10, 2015

Uh oh!

njsmith commented Sep 10, 2015

Uh oh!

nayyarv commented Sep 10, 2015

Uh oh!

seberg commented Sep 10, 2015 via email

Uh oh!

seberg commented Sep 10, 2015

Uh oh!

tacaswell commented Sep 10, 2015

Uh oh!

shoyer commented Sep 10, 2015

Uh oh!

njsmith commented Sep 10, 2015

Uh oh!

nayyarv commented Sep 11, 2015

Uh oh!

seberg commented Sep 11, 2015

Uh oh!

nayyarv commented Sep 22, 2015

Uh oh!

nayyarv Sep 23, 2015

Choose a reason for hiding this comment

Uh oh!

madphysicist Feb 14, 2016

Choose a reason for hiding this comment

Uh oh!

homu commented Feb 13, 2016

Uh oh!

seberg commented Feb 13, 2016

Uh oh!

madphysicist commented Feb 13, 2016

Uh oh!

seberg commented Feb 13, 2016

Uh oh!

njsmith commented Feb 13, 2016

Uh oh!

madphysicist commented Feb 14, 2016

Uh oh!

madphysicist commented Feb 14, 2016

Uh oh!

madphysicist commented Feb 14, 2016

Uh oh!

nayyarv commented Feb 14, 2016

Uh oh!

madphysicist commented Feb 14, 2016

Uh oh!

madphysicist commented Feb 14, 2016