ENH: Automatic number of bins for np.histogram #6029

nayyarv · 2015-06-30T05:20:40Z

Hi,
Brought this up on the mailing lists, and got some support. (It then turned into a discussion of the p-Square algorithm for dynamic histogram generation). I also had some support when I originally brought it up with the matplotlib guys (where I had initially planned to put it), you can find the thread here

I've added support for methods to automatically choose the number of bins based on the data provided. The default signature remains the same, however users can now pass 'auto' or other strings to have an optimal number of bins chosen. This is especially useful for visualisation libraries.

An (out of date) notebook with my first code attempts can be found here. It discusses the reasoning behind the methods + justification and samples. I've tried to make it slightly simpler since then.

I've provided the implementation which defines functions within the histogram function as it allows for easy refactoring and I didn't want to put another 5 functions inside the base library that have no use anywhere else. The code is hopefully easy to refactor/change to better fit numpy's style guide/organisation.

I've included the docstring update as well with some notes on the different methods and a quick example on how to use the new call. I've tested the code and it seems to work as expected so far. Passing in bins='s' does lead to choosing the sturges estimator (as opposed to the scott estimator) but I can't say for sure what will happen.

rgommers · 2015-06-30T05:45:12Z

numpy/lib/function_base.py

@@ -84,11 +84,15 @@ def histogram(a, bins=10, range=None, normed=False, weights=None,
    ----------
    a : array_like
        Input data. The histogram is computed over the flattened array.
-    bins : int or sequence of scalars, optional
+    bins : int or sequence of scalars, str, optional


formatting nitpick: or str instead of , str

rgommers · 2015-06-30T05:56:15Z

histogram has turned into a very long function now, and the implemented estimator functions might be reusable from histogram2d and histogramdd. So I suggest to create a separate function _hist_estimator(x, bins) that parses the strings (input validation) and returns the number of bins.

rgommers · 2015-06-30T05:57:46Z

numpy/lib/function_base.py

+                break
+        else:
+            # Maybe raise a Warning or something? Printing to stderr for now
+            print("Automatic method '{}' not found, reverting to default".format(bins), file=sys.stderr)


print statements should never be used. Raise ValueError here.

Also, normally you check for valid input first. So:

if name.lower() not in optimalityMethods.keys(): raise ValueError('%s not a valid method for `bins`" % name) bins = estimator(a)

rgommers · 2015-06-30T06:03:41Z

As a bonus, it would be nice if the docstring had an example that used 'auto' and actually created a plot with it. There are examples for other function that use matplotlib for plots that you could copy.

rgommers · 2015-06-30T06:04:01Z

Some style comments, but overall +1 on this addition.

nayyarv · 2015-06-30T08:08:42Z

numpy/lib/function_base.py

+            # measure of variability - Mean Absolute Deviation (mad)
+            iqr = mad(x)
+
+        if iqr > 0:


Can't really do an if-elif block. I replace the value of iqr with mad if the iqr is 0. If the mad is 0 too, then I return 1 as the number of bins, otherwise I calculate the FD estimator.
I did move the return 1 into the if iqr block though

nayyarv · 2015-07-03T05:45:12Z

Thanks for the feedback, I've put together the commits that should have addressed the issues.

As it currently stands, _hist_bandwidth_estimator should not be used for the 2d or dd histograms, the estimators binwidth suggestions is dependent on the number of dimensions (there is an implicit d=1). I have limited knowledge of higher dimension optimal bin measures, so I can't go beyond that just yet.

Some questions/thoughts

Should I switch to a bunch of if-elif statements inside _hist_bandwidth_estimator as opposed to the current nested functions?
Would it be better to put the information on the estimators in the notes and provide a link to the Notes heading? As it is I think the parameters section has become too verbose, and beyond the scope of the section.

jaimefrio · 2015-07-03T05:58:41Z

Shouldn't bandwith be replaced throughout with binwidth?

nayyarv · 2015-07-13T06:32:12Z

Changed all mentions of bandwidth to either binwidth or optimal_numbins to better reflect the nature of function.

I've included a picture here of what the example in the docstring produce, so it doesn't have to be left to imagination.

seberg · 2015-07-13T14:40:06Z

Good to see this converge, but unfortunately we will need tests before we can merge anything.

The tests should go into numpy/numpy/lib/tests/test_function_base.py.

homu · 2015-07-23T14:55:40Z

☔ The latest upstream changes (presumably #6100) made this pull request unmergeable. Please resolve the merge conflicts.

nayyarv · 2015-07-23T16:43:04Z

I've dealt with the upstream changes and included a bunch of unit tests that test basic functionality of the new automated methods.
Furthermore I squashed all the past commits into a single commit so it's not all over the place like before.

homu · 2015-07-26T05:17:44Z

☔ The latest upstream changes (presumably #6115) made this pull request unmergeable. Please resolve the merge conflicts.

tacaswell · 2015-07-26T08:06:24Z

Please ping me when this gets merged so the matplotlib docs can be updated.

ewmoore · 2015-08-14T15:00:50Z

I'm not sure we should be counting on plt.hist to do that. Is that
documented somewhere? Also I think the example for a function should
probably call that function.

On Fri, Aug 14, 2015 at 9:27 AM, Varun Nayyar [email protected]
wrote:

Thanks @shoyer https://github.com/shoyer, I took your suggestions on
board for the latest version of my PR. Other things that I did in this
commit

The parameter section in the docstring was getting too large, so
instead I moved the equations and explanations to the Notes while leaving a
brief one-liner in parameters for users.

The example section now uses plt.hist since the method actually
doesn't check input, it simply passes things onto np.histogram which
makes things a little simpler. I've also used a RandomState to make the
plot deterministic as suggested

Added some extra information into the assert error messages, as my test
setup made it difficult to work out which particular method was failing.
Also added in a test to check behaviour with small values since FD uses
percentile, which has limited meaning when there is only 1 datapoint.

I also made sure to generate the docfiles to check it looks ok, I've
included screenshots here of the relevant sections, as generated on my
computer
Parameters

[image: screenshot 2015-08-14 23 19 35]
https://cloud.githubusercontent.com/assets/1589119/9274669/facb2ff6-42da-11e5-8589-ca8ff230aa9c.png
Notes

[image: screenshot 2015-08-14 23 22 52]
https://cloud.githubusercontent.com/assets/1589119/9274732/6e290130-42db-11e5-8b39-95106c83f5d3.png
Examples

[image: screenshot 2015-08-14 23 15 53]
https://cloud.githubusercontent.com/assets/1589119/9274605/97cbec10-42da-11e5-9dc8-273d05dead51.png

—
Reply to this email directly or view it on GitHub
#6029 (comment).

tacaswell · 2015-08-14T15:10:09Z

@ewmoore From the mpl side I am counting on plt.hist to behave that way. We do not do any validation on the bins input and just pass it through to np.histogram. Once this is merged and released this behavior will be documented on our side.

Part of the history of this is that @nayyarv put a patch into mpl to add the auto logic and I suggested he try putting it in numpy instead 😉 .

seberg · 2015-08-14T15:42:24Z

Well, I guess you could argue that the docs are a bit backward, but we have plots in examples and frankly showing some plotting might be nice for the user and does give the right idea. I would be fine with adding the same/similar example without plotting additionally.
The plot example also has still two PEP8 style issues (spaces around = in func call and two spaces before #).

In any case, I think since @shoyer had some close look now too, I am willing to put it in as is.

@nayyarv, however, there are two more real things left though, sorry :(. Could you add the .. versionadded:: 1.11.0 tag [1]? np.take has an example, the empty lines are important.
As well as include it into the release notes (they are in numpy/doc/release), I think it is a neat new feature we should likely mention it in the improvements section.

[1] or do we want to squeeze it into 1.10? tend to think rather not just out of principle, plus I somewhat hope 1.11 will be a fast one

tacaswell · 2015-08-14T16:17:23Z

We are all friends here 😄 .

seberg · 2015-08-14T17:03:53Z

Uh, what happened? I hope this still exists?

nayyarv · 2015-08-14T17:05:11Z

Sorry, git mistake, I'm trying to squash the changes into a commit and it ended up going south.

While I have you, By something additional, did you mean something like this

>>> heights, edges = np.histogram(a, bins = 'auto')
>>> heights
array([  5,  34, 135, 274, 333, 195,  94,  82,  98, 145, 143, 157, 121, 89,  50,  24,  16,   5])
>>> len(heights)
18

followed by the plot?

seberg · 2015-08-14T17:13:37Z

Yeah, but I don't care much frankly, just pondering. Would seem fine, just leaving it is also fine. Or you could give the length of the heights returned, that shows quite obviously that the bin number was chosen somehow.

…Users can now pass in bins='auto' (or 'scott','fd','rice','sturges') and get the corresponding rule of thumb estimator provide a decent estimate of the optimal number of bins for the given the data.

nayyarv · 2015-08-14T17:44:17Z

Alright, all put together now and squashed into 1 commit. I didn't put in the extra example, I think the plot covers the idea that the number of bins have been chosen automatically.

I'd be keen to squeeze this into 1.10 (considering how often I use histograms), but I'm not too fussy.

njsmith · 2015-08-14T19:04:06Z

Let's stick with the process and keep the 1.10 branch to bug fixes only. We
have to make a cutoff somewhere and it will always be tempting to try and
sneak just one more thing past it, people have already started QA'ing the
beta... 1.11 will be along soon enough.
On Aug 14, 2015 10:44, "Varun Nayyar" [email protected] wrote:

Alright, all put together now and squashed into 1 commit. I didn't put in
the extra example, I think the plot covers the idea that the number of bins
have been chosen automatically.

I'd be keen to squeeze this into 1.10 (considering how often I use
histograms), but I'm not too fussy.

—
Reply to this email directly or view it on GitHub
#6029 (comment).

seberg · 2015-08-14T19:37:44Z

numpy/lib/function_base.py

@@ -140,6 +252,48 @@ def histogram(a, bins=10, range=None, normed=False, weights=None,
    second ``[2, 3)``.  The last bin, however, is ``[3, 4]``, which *includes*
    4.

+    .. versionadded:: 1.11.0


I would say, putting it only with the keyword is sufficient (but frankly we have reached bikeshedding). The other option we do is to putting it only in the Notes section. So will put this in soon (unless someone beats me to it). Lets not start breaking rules again, rather hope that 1.11 is soon. I feel there may be some nice things coming up soon enough ;).

nayyarv · 2015-08-15T09:50:14Z

No worries, @seberg, I'll wait for 1.11 then.

I didn't want the parameter section to be drowned out by discussion of the automatic methods (which is what happened originally) but wanted users to instantly know what the differences were without jumping to notes. This seemed like a decent compromise. I'm happy to let the docs be updated at a later date when someone has a better way of going about things.

seberg · 2015-08-15T13:53:02Z

OK, putting this in. If someone still finds some minor doc things or such, just open a new pull request is. Thanks a lot @nayyarv for your patience and everyone who helped reviewing!

ENH: Automatic number of bins for np.histogram

tacaswell · 2015-08-16T01:13:37Z

From the mpl side, i would like to see this in 1.10 as it is something that
we get asked for often, but understand holding the line on feature freeze
(even if I am really bad at it).

On Sat, Aug 15, 2015, 9:53 AM seberg [email protected] wrote:

Merged #6029 #6029.

—
Reply to this email directly or view it on GitHub
#6029 (comment).

nayyarv · 2015-08-16T12:03:43Z

Having had a look at the places this PR/original mpl issue have been referenced in, I have just realised the optimal bin methods don't even consider weights or range. For example if there were a 1000 samples in total, but only 100 inside range, the methods all assume the full datasize and overestimate the number of bins.

Similar with weighting - if there were 50 values with weight 20 each, the a.size will be 50, instead of the actual 1000 the methods should be using.

I've gone through some quick-fixes and updates here: master...nayyarv:autobinsWeighted
though it's made things a little more confusing since the functions are no longer stand alone. I need to simplify things a bit and add testing before I make a PR - this is a heads up.

An aside,
np.percentile doesn't support weights, neither does np.partition which means making a O(n) np.percentile that supports weights requires a lot of work. I've written a simple one that uses np.sort and np.searchsorted so it's about O(nlogn) time and matches outputs exactly with np.percentile(np.repeat(a, weights), q) without the memory overheads (though it could be quicker for small weights). It still needs testing, though randomised testing hasn't thrown any errors yet.

What would be the best solution to this?

Include my nlogn function and wait for np.percentile to
Usenp.percentile(np.repeat(a, weights), q) and put warnings up?
Disallow FD for weighted data, and default to Scott (since weighted Standard deviation is easier to calculate)?

seberg · 2015-08-16T15:56:16Z

First of all, is it really correct that if you have a weight of 20 each you would multiple the number of samples as 20? Number 3. might be an option, though I dislike that weights=np.ones(...) would not give the same result.

njsmith · 2015-08-16T22:45:37Z

Are weights required to be integers, as the repeat(...) implementation
assumes?
On Aug 16, 2015 8:56 AM, "seberg" [email protected] wrote:

First of all, is it really correct that if you have a weight of 20 each
you would multiple the number of samples as 20? Number 3. might be an
option, though I dislike that weights=np.ones(...) would not give the
same result.

—
Reply to this email directly or view it on GitHub
#6029 (comment).

shoyer · 2015-08-16T22:53:13Z

In my experience, fractional weights are also somewhat common. They are useful for modeling discrete probability distributions.

On Sun, Aug 16, 2015 at 3:45 PM, Nathaniel J. Smith
[email protected] wrote:

Are weights required to be integers, as the repeat(...) implementation
assumes?
On Aug 16, 2015 8:56 AM, "seberg" [email protected] wrote:

First of all, is it really correct that if you have a weight of 20 each
you would multiple the number of samples as 20? Number 3. might be an
option, though I dislike that weights=np.ones(...) would not give the
same result.

—
Reply to this email directly or view it on GitHub
#6029 (comment).

Reply to this email directly or view it on GitHub:
#6029 (comment)

nayyarv · 2015-08-17T13:48:49Z

Well, the effective sample size is sum of weights, but this assumes the weights are counts. I.e. repeat(a, weights) would return the full sample. So in this case, 20 observations of 50 sample does have an effective sample size of 1000 (and the difference between n^(1/3) being 4 or 10)

np.histogram as a function doesn't care what the weights are - it simply adds the appropriate weight value into the bin, which allows for things like negative, complex and fractional weights.

In terms of the estimators - they need to account for variability and size. When you have data like [2,3,4] and weights of [4,1,2], you have an effective sample size of 7, not 3 - and should calculate accordingly. having weights of [0.3, 0.4, 0.3] or [1+i, 2+3i, 4-i], or [-5, 1, 3] don't really tell us about effective sample size. I'm not sure what to do with something like [30.2, 29.8 22.34].

Furthermore, using the weighted standard deviation or weighted percentiles also assume weights represents counts/frequency of some kind.
For sd, Having a negative weight is identical to have the positive weight and the additive inverse of the x-value. Having a complex weight sends histogram into complex space for variability calculation. Fractional (0.3) or decimal (30.2) shouldn't have a problem with weighted sd, as long as they're divided by the sum of the weights.
Similarly for percentile, it's a positional question - if the weights are negative - what does that mean when calculating percentile x = [2,3], weights [ 1, -2], what's the median? Even decimal weights = [1.2, 1.8], what's the median? It corresponds to an cdf at count 1.5, so is it 3 or is it some linear interpolation of 2 and 3? If the weights were [1,1] the median would be 2.5, and if the weights are [1,2], then the median is clearly 3. Complex numbers don't help at all.

Since non-whole weights make no sense in the estimators, maybe restrict the optimal bins calculation if the weights are positive integers (which should be a large majority of use cases?)
Maybe ignore non-whole weights and simply work with a and range - but in use cases like this, the number of bins/ bin edges are likely predetermined.

njsmith · 2015-08-17T14:03:00Z

It wouldn't be the end of world if bins="auto" have an error for some
otherwise valid weights= arguments. But handling fractional weights at
least would be nice if there's a reasonable way; they are reasonably
common. (And we should have some reasonable+tested behavior for all
cases, even if that is just returning a nice error.)
On Aug 17, 2015 6:48 AM, "Varun Nayyar" [email protected] wrote:

Well, the effective sample size is sum of weights, but this assumes the
weights are counts. I.e. repeat(a, weights) would return the full sample.
So in this case, 20 observations of 50 sample does have an effective sample
size of 1000 (and the difference between n^(1/3) being 4 or 10)

np.histogram as a function doesn't care what the weights are - it simply
adds the appropriate weight value into the bin, which allows for things
like negative, complex and fractional weights.

In terms of the estimators - they need to account for variability and
size. When you have data like [2,3,4] and weights of [4,1,2], you have an
effective sample size of 7, not 3 - and should calculate accordingly.
having weights of [0.3, 0.4, 0.3] or [1+i, 2+3i, 4-i], or [-5, 1, 3] don't
really tell us about effective sample size. I'm not sure what to do with
something like [30.2, 29.8 22.34].

Furthermore, using the weighted standard deviation or weighted percentiles
also assume weights represents counts/frequency of some kind.
For sd, Having a negative weight is identical to have the positive weight
and the additive inverse of the x-value. Having a complex weight sends
histogram into complex space for variability calculation. Fractional (0.3)
or decimal (30.2) shouldn't have a problem with weighted sd, as long as
they're divided by the sum of the weights.
Similarly for percentile, it's a positional question - if the weights are
negative - what does that mean when calculating percentile x = [2,3],
weights [ 1, -2], what's the median? Even decimal weights = [1.2, 1.8],
what's the median? It corresponds to an cdf at count 1.5, so is it 3 or is
it some linear interpolation of 2 and 3? If the weights were [1,1] the
median would be 2.5, and if the weights are [1,2], then the median is
clearly 3. Complex numbers don't help at all.

Since non-whole weights make no sense in the estimators, maybe restrict
the optimal bins calculation if the weights are positive integers (which
should be a large majority of use cases?)
Maybe ignore non-whole weights and simply work with a and range - but in
use cases like this, the number of bins/ bin edges are likely predetermined.

—
Reply to this email directly or view it on GitHub
#6029 (comment).

nayyarv · 2015-08-17T14:25:05Z

Fair enough, I can deal with fractional weighting by using a.size instead of weights.sum() for sample size and raise errors for unsupported weights types

Though since np.percentile doesn't support weights, auto should only call FD for unweighted data, and scott for weighted data (though like seberg, I'm not overly happy with this - maybe a quick check that np.all(weights == weights[0]) somewhere at the start?)

seberg · 2015-08-17T14:32:06Z

Indeed, it is not like "auto" is a default, it is just the default suggested one (without bins). I do not like the explosion of combinations/complexity, and I definitely do not want to guess what type of weights we have based on the data type!

Either we ignore the problem and document that weights are ignored for these (which is a bit dubious), or we just throw an error.

If someone needs more, I think the only option would be to expose the estimator functions and add different types of weights to them (i.e. aweights/fweights) and that might be more the type of things for a statics package to handle.
Trying to fit that into histogram itself seems too much complexity for a function that does not need to care about where the weights came from.

Or we just pull them out of histogram completely again, expose the (more complicated) estimators somewhere and have matplotlib add the bins="method" syntactic sugar.

nayyarv changed the title ~~Automatic number of bins for np.histogram~~ ENH: Automatic number of bins for np.histogram Jun 30, 2015

rgommers added 01 - Enhancement component: numpy.lib labels Jun 30, 2015

rgommers reviewed Jun 30, 2015
View reviewed changes

nayyarv reviewed Jun 30, 2015
View reviewed changes

nayyarv closed this Jul 23, 2015

nayyarv force-pushed the master branch from be1a16b to eb7104e Compare July 23, 2015 16:16

nayyarv reopened this Jul 23, 2015

nayyarv force-pushed the master branch from b3cfb02 to c00919d Compare July 23, 2015 16:38

nayyarv force-pushed the master branch 2 times, most recently from 1b30a29 to 8eed69a Compare July 26, 2015 08:25

nayyarv closed this Aug 14, 2015

nayyarv force-pushed the master branch from 32cc8ed to ed9a0f3 Compare August 14, 2015 16:59

nayyarv reopened this Aug 14, 2015

ENH: Adding in automatic number of bins estimation for np.histogram. …

388ee59

…Users can now pass in bins='auto' (or 'scott','fd','rice','sturges') and get the corresponding rule of thumb estimator provide a decent estimate of the optimal number of bins for the given the data.

nayyarv force-pushed the master branch from ab18902 to 388ee59 Compare August 14, 2015 17:35

seberg reviewed Aug 14, 2015
View reviewed changes

seberg added a commit that referenced this pull request Aug 15, 2015

Merge pull request #6029 from nayyarv/master

6e8b869

ENH: Automatic number of bins for np.histogram

seberg merged commit 6e8b869 into numpy:master Aug 15, 2015

This was referenced Sep 7, 2015

MAINT/ENH: Support for weights and range when estimating optimal number of bins #6288

Closed

Should default hist() bins be changed in 2.0? matplotlib/matplotlib#4487

Closed

Uh oh!

ENH: Automatic number of bins for np.histogram #6029

ENH: Automatic number of bins for np.histogram #6029

Uh oh!

Conversation

nayyarv commented Jun 30, 2015

Uh oh!

rgommers Jun 30, 2015

Choose a reason for hiding this comment

Uh oh!

rgommers commented Jun 30, 2015

Uh oh!

rgommers Jun 30, 2015

Choose a reason for hiding this comment

Uh oh!

rgommers Jun 30, 2015

Choose a reason for hiding this comment

Uh oh!

rgommers commented Jun 30, 2015

Uh oh!

rgommers commented Jun 30, 2015

Uh oh!

nayyarv Jun 30, 2015

Choose a reason for hiding this comment

Uh oh!

nayyarv commented Jul 3, 2015

Uh oh!

jaimefrio commented Jul 3, 2015

Uh oh!

nayyarv commented Jul 13, 2015

Uh oh!

seberg commented Jul 13, 2015

Uh oh!

homu commented Jul 23, 2015

Uh oh!

nayyarv commented Jul 23, 2015

Uh oh!

homu commented Jul 26, 2015

Uh oh!

tacaswell commented Jul 26, 2015

Uh oh!

ewmoore commented Aug 14, 2015

Uh oh!

tacaswell commented Aug 14, 2015

Uh oh!

seberg commented Aug 14, 2015

Uh oh!

tacaswell commented Aug 14, 2015

Uh oh!

seberg commented Aug 14, 2015

Uh oh!

nayyarv commented Aug 14, 2015

Uh oh!

seberg commented Aug 14, 2015

Uh oh!

nayyarv commented Aug 14, 2015

Uh oh!

njsmith commented Aug 14, 2015

Uh oh!

seberg Aug 14, 2015

Choose a reason for hiding this comment

Uh oh!

nayyarv commented Aug 15, 2015

Uh oh!

seberg commented Aug 15, 2015

Uh oh!

tacaswell commented Aug 16, 2015

Uh oh!

nayyarv commented Aug 16, 2015

Uh oh!

seberg commented Aug 16, 2015

Uh oh!

njsmith commented Aug 16, 2015

Uh oh!

shoyer commented Aug 16, 2015

Uh oh!

nayyarv commented Aug 17, 2015

Uh oh!

njsmith commented Aug 17, 2015

Uh oh!