Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: Automatic number of bins for np.histogram #6029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 15, 2015
Merged

Conversation

nayyarv
Copy link
Contributor

@nayyarv nayyarv commented Jun 30, 2015

Hi,
Brought this up on the mailing lists, and got some support. (It then turned into a discussion of the p-Square algorithm for dynamic histogram generation). I also had some support when I originally brought it up with the matplotlib guys (where I had initially planned to put it), you can find the thread here

I've added support for methods to automatically choose the number of bins based on the data provided. The default signature remains the same, however users can now pass 'auto' or other strings to have an optimal number of bins chosen. This is especially useful for visualisation libraries.

An (out of date) notebook with my first code attempts can be found here. It discusses the reasoning behind the methods + justification and samples. I've tried to make it slightly simpler since then.

I've provided the implementation which defines functions within the histogram function as it allows for easy refactoring and I didn't want to put another 5 functions inside the base library that have no use anywhere else. The code is hopefully easy to refactor/change to better fit numpy's style guide/organisation.

I've included the docstring update as well with some notes on the different methods and a quick example on how to use the new call. I've tested the code and it seems to work as expected so far. Passing in bins='s' does lead to choosing the sturges estimator (as opposed to the scott estimator) but I can't say for sure what will happen.

@nayyarv nayyarv changed the title Automatic number of bins for np.histogram ENH: Automatic number of bins for np.histogram Jun 30, 2015
@@ -84,11 +84,15 @@ def histogram(a, bins=10, range=None, normed=False, weights=None,
----------
a : array_like
Input data. The histogram is computed over the flattened array.
bins : int or sequence of scalars, optional
bins : int or sequence of scalars, str, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting nitpick: or str instead of , str

@rgommers
Copy link
Member

histogram has turned into a very long function now, and the implemented estimator functions might be reusable from histogram2d and histogramdd. So I suggest to create a separate function _hist_estimator(x, bins) that parses the strings (input validation) and returns the number of bins.

break
else:
# Maybe raise a Warning or something? Printing to stderr for now
print("Automatic method '{}' not found, reverting to default".format(bins), file=sys.stderr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print statements should never be used. Raise ValueError here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, normally you check for valid input first. So:

if name.lower() not in optimalityMethods.keys():
    raise ValueError('%s not a valid method for `bins`" % name)

bins = estimator(a)

@rgommers
Copy link
Member

As a bonus, it would be nice if the docstring had an example that used 'auto' and actually created a plot with it. There are examples for other function that use matplotlib for plots that you could copy.

@rgommers
Copy link
Member

Some style comments, but overall +1 on this addition.

# measure of variability - Mean Absolute Deviation (mad)
iqr = mad(x)

if iqr > 0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't really do an if-elif block. I replace the value of iqr with mad if the iqr is 0. If the mad is 0 too, then I return 1 as the number of bins, otherwise I calculate the FD estimator.
I did move the return 1 into the if iqr block though

@nayyarv
Copy link
Contributor Author

nayyarv commented Jul 3, 2015

Thanks for the feedback, I've put together the commits that should have addressed the issues.

As it currently stands, _hist_bandwidth_estimator should not be used for the 2d or dd histograms, the estimators binwidth suggestions is dependent on the number of dimensions (there is an implicit d=1). I have limited knowledge of higher dimension optimal bin measures, so I can't go beyond that just yet.

Some questions/thoughts

  1. Should I switch to a bunch of if-elif statements inside _hist_bandwidth_estimator as opposed to the current nested functions?
  2. Would it be better to put the information on the estimators in the notes and provide a link to the Notes heading? As it is I think the parameters section has become too verbose, and beyond the scope of the section.

@jaimefrio
Copy link
Member

Shouldn't bandwith be replaced throughout with binwidth?

@nayyarv
Copy link
Contributor Author

nayyarv commented Jul 13, 2015

Changed all mentions of bandwidth to either binwidth or optimal_numbins to better reflect the nature of function.

I've included a picture here of what the example in the docstring produce, so it doesn't have to be left to imagination.
histogram

@seberg
Copy link
Member

seberg commented Jul 13, 2015

Good to see this converge, but unfortunately we will need tests before we can merge anything.

The tests should go into numpy/numpy/lib/tests/test_function_base.py.

@homu
Copy link
Contributor

homu commented Jul 23, 2015

☔ The latest upstream changes (presumably #6100) made this pull request unmergeable. Please resolve the merge conflicts.

@nayyarv
Copy link
Contributor Author

nayyarv commented Jul 23, 2015

I've dealt with the upstream changes and included a bunch of unit tests that test basic functionality of the new automated methods.
Furthermore I squashed all the past commits into a single commit so it's not all over the place like before.

@homu
Copy link
Contributor

homu commented Jul 26, 2015

☔ The latest upstream changes (presumably #6115) made this pull request unmergeable. Please resolve the merge conflicts.

@tacaswell
Copy link
Contributor

Please ping me when this gets merged so the matplotlib docs can be updated.

@nayyarv nayyarv force-pushed the master branch 2 times, most recently from 1b30a29 to 8eed69a Compare July 26, 2015 08:25
@ewmoore
Copy link
Contributor

ewmoore commented Aug 14, 2015

I'm not sure we should be counting on plt.hist to do that. Is that
documented somewhere? Also I think the example for a function should
probably call that function.

On Fri, Aug 14, 2015 at 9:27 AM, Varun Nayyar [email protected]
wrote:

Thanks @shoyer https://github.com/shoyer, I took your suggestions on
board for the latest version of my PR. Other things that I did in this
commit

  1. The parameter section in the docstring was getting too large, so
    instead I moved the equations and explanations to the Notes while leaving a
    brief one-liner in parameters for users.
  2. The example section now uses plt.hist since the method actually
    doesn't check input, it simply passes things onto np.histogram which
    makes things a little simpler. I've also used a RandomState to make the
    plot deterministic as suggested
  3. Added some extra information into the assert error messages, as my test
    setup made it difficult to work out which particular method was failing.
    Also added in a test to check behaviour with small values since FD uses
    percentile, which has limited meaning when there is only 1 datapoint.

I also made sure to generate the docfiles to check it looks ok, I've
included screenshots here of the relevant sections, as generated on my
computer
Parameters

[image: screenshot 2015-08-14 23 19 35]
https://cloud.githubusercontent.com/assets/1589119/9274669/facb2ff6-42da-11e5-8589-ca8ff230aa9c.png
Notes

[image: screenshot 2015-08-14 23 22 52]
https://cloud.githubusercontent.com/assets/1589119/9274732/6e290130-42db-11e5-8b39-95106c83f5d3.png
Examples

[image: screenshot 2015-08-14 23 15 53]
https://cloud.githubusercontent.com/assets/1589119/9274605/97cbec10-42da-11e5-9dc8-273d05dead51.png


Reply to this email directly or view it on GitHub
#6029 (comment).

@tacaswell
Copy link
Contributor

@ewmoore From the mpl side I am counting on plt.hist to behave that way. We do not do any validation on the bins input and just pass it through to np.histogram. Once this is merged and released this behavior will be documented on our side.

Part of the history of this is that @nayyarv put a patch into mpl to add the auto logic and I suggested he try putting it in numpy instead 😉 .

@seberg
Copy link
Member

seberg commented Aug 14, 2015

Well, I guess you could argue that the docs are a bit backward, but we have plots in examples and frankly showing some plotting might be nice for the user and does give the right idea. I would be fine with adding the same/similar example without plotting additionally.
The plot example also has still two PEP8 style issues (spaces around = in func call and two spaces before #).

In any case, I think since @shoyer had some close look now too, I am willing to put it in as is.

@nayyarv, however, there are two more real things left though, sorry :(. Could you add the .. versionadded:: 1.11.0 tag [1]? np.take has an example, the empty lines are important.
As well as include it into the release notes (they are in numpy/doc/release), I think it is a neat new feature we should likely mention it in the improvements section.

[1] or do we want to squeeze it into 1.10? tend to think rather not just out of principle, plus I somewhat hope 1.11 will be a fast one

@tacaswell
Copy link
Contributor

We are all friends here 😄 .

@seberg
Copy link
Member

seberg commented Aug 14, 2015

Uh, what happened? I hope this still exists?

@nayyarv
Copy link
Contributor Author

nayyarv commented Aug 14, 2015

Sorry, git mistake, I'm trying to squash the changes into a commit and it ended up going south.

While I have you, By something additional, did you mean something like this

>>> heights, edges = np.histogram(a, bins = 'auto')
>>> heights
array([  5,  34, 135, 274, 333, 195,  94,  82,  98, 145, 143, 157, 121, 89,  50,  24,  16,   5])
>>> len(heights)
18

followed by the plot?

@seberg
Copy link
Member

seberg commented Aug 14, 2015

Yeah, but I don't care much frankly, just pondering. Would seem fine, just leaving it is also fine. Or you could give the length of the heights returned, that shows quite obviously that the bin number was chosen somehow.

@nayyarv nayyarv reopened this Aug 14, 2015
…Users can now pass in bins='auto' (or 'scott','fd','rice','sturges') and get the corresponding rule of thumb estimator provide a decent estimate of the optimal number of bins for the given the data.
@nayyarv
Copy link
Contributor Author

nayyarv commented Aug 14, 2015

Alright, all put together now and squashed into 1 commit. I didn't put in the extra example, I think the plot covers the idea that the number of bins have been chosen automatically.

I'd be keen to squeeze this into 1.10 (considering how often I use histograms), but I'm not too fussy.

@njsmith
Copy link
Member

njsmith commented Aug 14, 2015

Let's stick with the process and keep the 1.10 branch to bug fixes only. We
have to make a cutoff somewhere and it will always be tempting to try and
sneak just one more thing past it, people have already started QA'ing the
beta... 1.11 will be along soon enough.
On Aug 14, 2015 10:44, "Varun Nayyar" [email protected] wrote:

Alright, all put together now and squashed into 1 commit. I didn't put in
the extra example, I think the plot covers the idea that the number of bins
have been chosen automatically.

I'd be keen to squeeze this into 1.10 (considering how often I use
histograms), but I'm not too fussy.


Reply to this email directly or view it on GitHub
#6029 (comment).

@@ -140,6 +252,48 @@ def histogram(a, bins=10, range=None, normed=False, weights=None,
second ``[2, 3)``. The last bin, however, is ``[3, 4]``, which *includes*
4.

.. versionadded:: 1.11.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say, putting it only with the keyword is sufficient (but frankly we have reached bikeshedding). The other option we do is to putting it only in the Notes section. So will put this in soon (unless someone beats me to it). Lets not start breaking rules again, rather hope that 1.11 is soon. I feel there may be some nice things coming up soon enough ;).

@nayyarv
Copy link
Contributor Author

nayyarv commented Aug 15, 2015

No worries, @seberg, I'll wait for 1.11 then.

I didn't want the parameter section to be drowned out by discussion of the automatic methods (which is what happened originally) but wanted users to instantly know what the differences were without jumping to notes. This seemed like a decent compromise. I'm happy to let the docs be updated at a later date when someone has a better way of going about things.

@seberg
Copy link
Member

seberg commented Aug 15, 2015

OK, putting this in. If someone still finds some minor doc things or such, just open a new pull request is. Thanks a lot @nayyarv for your patience and everyone who helped reviewing!

seberg added a commit that referenced this pull request Aug 15, 2015
ENH: Automatic number of bins for np.histogram
@seberg seberg merged commit 6e8b869 into numpy:master Aug 15, 2015
@tacaswell
Copy link
Contributor

From the mpl side, i would like to see this in 1.10 as it is something that
we get asked for often, but understand holding the line on feature freeze
(even if I am really bad at it).

On Sat, Aug 15, 2015, 9:53 AM seberg [email protected] wrote:

Merged #6029 #6029.


Reply to this email directly or view it on GitHub
#6029 (comment).

@nayyarv
Copy link
Contributor Author

nayyarv commented Aug 16, 2015

Having had a look at the places this PR/original mpl issue have been referenced in, I have just realised the optimal bin methods don't even consider weights or range. For example if there were a 1000 samples in total, but only 100 inside range, the methods all assume the full datasize and overestimate the number of bins.

Similar with weighting - if there were 50 values with weight 20 each, the a.size will be 50, instead of the actual 1000 the methods should be using.

I've gone through some quick-fixes and updates here: master...nayyarv:autobinsWeighted
though it's made things a little more confusing since the functions are no longer stand alone. I need to simplify things a bit and add testing before I make a PR - this is a heads up.

An aside,
np.percentile doesn't support weights, neither does np.partition which means making a O(n) np.percentile that supports weights requires a lot of work. I've written a simple one that uses np.sort and np.searchsorted so it's about O(nlogn) time and matches outputs exactly with np.percentile(np.repeat(a, weights), q) without the memory overheads (though it could be quicker for small weights). It still needs testing, though randomised testing hasn't thrown any errors yet.

What would be the best solution to this?

  1. Include my nlogn function and wait for np.percentile to
  2. Usenp.percentile(np.repeat(a, weights), q) and put warnings up?
  3. Disallow FD for weighted data, and default to Scott (since weighted Standard deviation is easier to calculate)?

@seberg
Copy link
Member

seberg commented Aug 16, 2015

First of all, is it really correct that if you have a weight of 20 each you would multiple the number of samples as 20? Number 3. might be an option, though I dislike that weights=np.ones(...) would not give the same result.

@njsmith
Copy link
Member

njsmith commented Aug 16, 2015

Are weights required to be integers, as the repeat(...) implementation
assumes?
On Aug 16, 2015 8:56 AM, "seberg" [email protected] wrote:

First of all, is it really correct that if you have a weight of 20 each
you would multiple the number of samples as 20? Number 3. might be an
option, though I dislike that weights=np.ones(...) would not give the
same result.


Reply to this email directly or view it on GitHub
#6029 (comment).

@shoyer
Copy link
Member

shoyer commented Aug 16, 2015

In my experience, fractional weights are also somewhat common. They are useful for modeling discrete probability distributions.

On Sun, Aug 16, 2015 at 3:45 PM, Nathaniel J. Smith
[email protected] wrote:

Are weights required to be integers, as the repeat(...) implementation
assumes?
On Aug 16, 2015 8:56 AM, "seberg" [email protected] wrote:

First of all, is it really correct that if you have a weight of 20 each
you would multiple the number of samples as 20? Number 3. might be an
option, though I dislike that weights=np.ones(...) would not give the
same result.


Reply to this email directly or view it on GitHub
#6029 (comment).


Reply to this email directly or view it on GitHub:
#6029 (comment)

@nayyarv
Copy link
Contributor Author

nayyarv commented Aug 17, 2015

Well, the effective sample size is sum of weights, but this assumes the weights are counts. I.e. repeat(a, weights) would return the full sample. So in this case, 20 observations of 50 sample does have an effective sample size of 1000 (and the difference between n^(1/3) being 4 or 10)

np.histogram as a function doesn't care what the weights are - it simply adds the appropriate weight value into the bin, which allows for things like negative, complex and fractional weights.

In terms of the estimators - they need to account for variability and size. When you have data like [2,3,4] and weights of [4,1,2], you have an effective sample size of 7, not 3 - and should calculate accordingly. having weights of [0.3, 0.4, 0.3] or [1+i, 2+3i, 4-i], or [-5, 1, 3] don't really tell us about effective sample size. I'm not sure what to do with something like [30.2, 29.8 22.34].

Furthermore, using the weighted standard deviation or weighted percentiles also assume weights represents counts/frequency of some kind.
For sd, Having a negative weight is identical to have the positive weight and the additive inverse of the x-value. Having a complex weight sends histogram into complex space for variability calculation. Fractional (0.3) or decimal (30.2) shouldn't have a problem with weighted sd, as long as they're divided by the sum of the weights.
Similarly for percentile, it's a positional question - if the weights are negative - what does that mean when calculating percentile x = [2,3], weights [ 1, -2], what's the median? Even decimal weights = [1.2, 1.8], what's the median? It corresponds to an cdf at count 1.5, so is it 3 or is it some linear interpolation of 2 and 3? If the weights were [1,1] the median would be 2.5, and if the weights are [1,2], then the median is clearly 3. Complex numbers don't help at all.

Since non-whole weights make no sense in the estimators, maybe restrict the optimal bins calculation if the weights are positive integers (which should be a large majority of use cases?)
Maybe ignore non-whole weights and simply work with a and range - but in use cases like this, the number of bins/ bin edges are likely predetermined.

@njsmith
Copy link
Member

njsmith commented Aug 17, 2015

It wouldn't be the end of world if bins="auto" have an error for some
otherwise valid weights= arguments. But handling fractional weights at
least would be nice if there's a reasonable way; they are reasonably
common. (And we should have some reasonable+tested behavior for all
cases, even if that is just returning a nice error.)
On Aug 17, 2015 6:48 AM, "Varun Nayyar" [email protected] wrote:

Well, the effective sample size is sum of weights, but this assumes the
weights are counts. I.e. repeat(a, weights) would return the full sample.
So in this case, 20 observations of 50 sample does have an effective sample
size of 1000 (and the difference between n^(1/3) being 4 or 10)

np.histogram as a function doesn't care what the weights are - it simply
adds the appropriate weight value into the bin, which allows for things
like negative, complex and fractional weights.

In terms of the estimators - they need to account for variability and
size. When you have data like [2,3,4] and weights of [4,1,2], you have an
effective sample size of 7, not 3 - and should calculate accordingly.
having weights of [0.3, 0.4, 0.3] or [1+i, 2+3i, 4-i], or [-5, 1, 3] don't
really tell us about effective sample size. I'm not sure what to do with
something like [30.2, 29.8 22.34].

Furthermore, using the weighted standard deviation or weighted percentiles
also assume weights represents counts/frequency of some kind.
For sd, Having a negative weight is identical to have the positive weight
and the additive inverse of the x-value. Having a complex weight sends
histogram into complex space for variability calculation. Fractional (0.3)
or decimal (30.2) shouldn't have a problem with weighted sd, as long as
they're divided by the sum of the weights.
Similarly for percentile, it's a positional question - if the weights are
negative - what does that mean when calculating percentile x = [2,3],
weights [ 1, -2], what's the median? Even decimal weights = [1.2, 1.8],
what's the median? It corresponds to an cdf at count 1.5, so is it 3 or is
it some linear interpolation of 2 and 3? If the weights were [1,1] the
median would be 2.5, and if the weights are [1,2], then the median is
clearly 3. Complex numbers don't help at all.

Since non-whole weights make no sense in the estimators, maybe restrict
the optimal bins calculation if the weights are positive integers (which
should be a large majority of use cases?)
Maybe ignore non-whole weights and simply work with a and range - but in
use cases like this, the number of bins/ bin edges are likely predetermined.


Reply to this email directly or view it on GitHub
#6029 (comment).

@nayyarv
Copy link
Contributor Author

nayyarv commented Aug 17, 2015

Fair enough, I can deal with fractional weighting by using a.size instead of weights.sum() for sample size and raise errors for unsupported weights types

Though since np.percentile doesn't support weights, auto should only call FD for unweighted data, and scott for weighted data (though like seberg, I'm not overly happy with this - maybe a quick check that np.all(weights == weights[0]) somewhere at the start?)

@seberg
Copy link
Member

seberg commented Aug 17, 2015

Indeed, it is not like "auto" is a default, it is just the default suggested one (without bins). I do not like the explosion of combinations/complexity, and I definitely do not want to guess what type of weights we have based on the data type!

Either we ignore the problem and document that weights are ignored for these (which is a bit dubious), or we just throw an error.

If someone needs more, I think the only option would be to expose the estimator functions and add different types of weights to them (i.e. aweights/fweights) and that might be more the type of things for a statics package to handle.
Trying to fit that into histogram itself seems too much complexity for a function that does not need to care about where the weights came from.

Or we just pull them out of histogram completely again, expose the (more complicated) estimators somewhere and have matplotlib add the bins="method" syntactic sugar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants