-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Automatic number of bins for np.histogram #6029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -84,11 +84,15 @@ def histogram(a, bins=10, range=None, normed=False, weights=None, | |||
---------- | |||
a : array_like | |||
Input data. The histogram is computed over the flattened array. | |||
bins : int or sequence of scalars, optional | |||
bins : int or sequence of scalars, str, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting nitpick: or str
instead of , str
|
break | ||
else: | ||
# Maybe raise a Warning or something? Printing to stderr for now | ||
print("Automatic method '{}' not found, reverting to default".format(bins), file=sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print
statements should never be used. Raise ValueError here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, normally you check for valid input first. So:
if name.lower() not in optimalityMethods.keys():
raise ValueError('%s not a valid method for `bins`" % name)
bins = estimator(a)
As a bonus, it would be nice if the docstring had an example that used |
Some style comments, but overall +1 on this addition. |
# measure of variability - Mean Absolute Deviation (mad) | ||
iqr = mad(x) | ||
|
||
if iqr > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't really do an if-elif block. I replace the value of iqr with mad if the iqr is 0. If the mad is 0 too, then I return 1 as the number of bins, otherwise I calculate the FD estimator.
I did move the return 1 into the if iqr
block though
Thanks for the feedback, I've put together the commits that should have addressed the issues. As it currently stands, _hist_bandwidth_estimator should not be used for the 2d or dd histograms, the estimators binwidth suggestions is dependent on the number of dimensions (there is an implicit d=1). I have limited knowledge of higher dimension optimal bin measures, so I can't go beyond that just yet. Some questions/thoughts
|
Shouldn't |
Good to see this converge, but unfortunately we will need tests before we can merge anything. The tests should go into |
☔ The latest upstream changes (presumably #6100) made this pull request unmergeable. Please resolve the merge conflicts. |
I've dealt with the upstream changes and included a bunch of unit tests that test basic functionality of the new automated methods. |
☔ The latest upstream changes (presumably #6115) made this pull request unmergeable. Please resolve the merge conflicts. |
Please ping me when this gets merged so the matplotlib docs can be updated. |
1b30a29
to
8eed69a
Compare
I'm not sure we should be counting on plt.hist to do that. Is that On Fri, Aug 14, 2015 at 9:27 AM, Varun Nayyar [email protected]
|
@ewmoore From the mpl side I am counting on Part of the history of this is that @nayyarv put a patch into mpl to add the auto logic and I suggested he try putting it in numpy instead 😉 . |
Well, I guess you could argue that the docs are a bit backward, but we have plots in examples and frankly showing some plotting might be nice for the user and does give the right idea. I would be fine with adding the same/similar example without plotting additionally. In any case, I think since @shoyer had some close look now too, I am willing to put it in as is. @nayyarv, however, there are two more real things left though, sorry :(. Could you add the [1] or do we want to squeeze it into 1.10? tend to think rather not just out of principle, plus I somewhat hope 1.11 will be a fast one |
We are all friends here 😄 . |
Uh, what happened? I hope this still exists? |
Sorry, git mistake, I'm trying to squash the changes into a commit and it ended up going south. While I have you, By something additional, did you mean something like this >>> heights, edges = np.histogram(a, bins = 'auto')
>>> heights
array([ 5, 34, 135, 274, 333, 195, 94, 82, 98, 145, 143, 157, 121, 89, 50, 24, 16, 5])
>>> len(heights)
18 followed by the plot? |
Yeah, but I don't care much frankly, just pondering. Would seem fine, just leaving it is also fine. Or you could give the length of the heights returned, that shows quite obviously that the bin number was chosen somehow. |
…Users can now pass in bins='auto' (or 'scott','fd','rice','sturges') and get the corresponding rule of thumb estimator provide a decent estimate of the optimal number of bins for the given the data.
Alright, all put together now and squashed into 1 commit. I didn't put in the extra example, I think the plot covers the idea that the number of bins have been chosen automatically. I'd be keen to squeeze this into 1.10 (considering how often I use histograms), but I'm not too fussy. |
Let's stick with the process and keep the 1.10 branch to bug fixes only. We
|
@@ -140,6 +252,48 @@ def histogram(a, bins=10, range=None, normed=False, weights=None, | |||
second ``[2, 3)``. The last bin, however, is ``[3, 4]``, which *includes* | |||
4. | |||
|
|||
.. versionadded:: 1.11.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say, putting it only with the keyword is sufficient (but frankly we have reached bikeshedding). The other option we do is to putting it only in the Notes section. So will put this in soon (unless someone beats me to it). Lets not start breaking rules again, rather hope that 1.11 is soon. I feel there may be some nice things coming up soon enough ;).
No worries, @seberg, I'll wait for 1.11 then. I didn't want the parameter section to be drowned out by discussion of the automatic methods (which is what happened originally) but wanted users to instantly know what the differences were without jumping to notes. This seemed like a decent compromise. I'm happy to let the docs be updated at a later date when someone has a better way of going about things. |
OK, putting this in. If someone still finds some minor doc things or such, just open a new pull request is. Thanks a lot @nayyarv for your patience and everyone who helped reviewing! |
ENH: Automatic number of bins for np.histogram
From the mpl side, i would like to see this in 1.10 as it is something that On Sat, Aug 15, 2015, 9:53 AM seberg [email protected] wrote:
|
Having had a look at the places this PR/original mpl issue have been referenced in, I have just realised the optimal bin methods don't even consider Similar with weighting - if there were 50 values with weight 20 each, the I've gone through some quick-fixes and updates here: master...nayyarv:autobinsWeighted An aside, What would be the best solution to this?
|
First of all, is it really correct that if you have a weight of 20 each you would multiple the number of samples as 20? Number 3. might be an option, though I dislike that |
Are weights required to be integers, as the repeat(...) implementation
|
In my experience, fractional weights are also somewhat common. They are useful for modeling discrete probability distributions. On Sun, Aug 16, 2015 at 3:45 PM, Nathaniel J. Smith
|
Well, the effective sample size is sum of weights, but this assumes the weights are counts. I.e.
In terms of the estimators - they need to account for variability and size. When you have data like [2,3,4] and weights of [4,1,2], you have an effective sample size of 7, not 3 - and should calculate accordingly. having weights of [0.3, 0.4, 0.3] or [1+i, 2+3i, 4-i], or [-5, 1, 3] don't really tell us about effective sample size. I'm not sure what to do with something like [30.2, 29.8 22.34]. Furthermore, using the weighted standard deviation or weighted percentiles also assume weights represents counts/frequency of some kind. Since non-whole weights make no sense in the estimators, maybe restrict the optimal bins calculation if the weights are positive integers (which should be a large majority of use cases?) |
It wouldn't be the end of world if bins="auto" have an error for some
|
Fair enough, I can deal with fractional weighting by using Though since |
Indeed, it is not like Either we ignore the problem and document that weights are ignored for these (which is a bit dubious), or we just throw an error. If someone needs more, I think the only option would be to expose the estimator functions and add different types of weights to them (i.e. aweights/fweights) and that might be more the type of things for a statics package to handle. Or we just pull them out of histogram completely again, expose the (more complicated) estimators somewhere and have matplotlib add the |
Hi,
Brought this up on the mailing lists, and got some support. (It then turned into a discussion of the p-Square algorithm for dynamic histogram generation). I also had some support when I originally brought it up with the matplotlib guys (where I had initially planned to put it), you can find the thread here
I've added support for methods to automatically choose the number of bins based on the data provided. The default signature remains the same, however users can now pass 'auto' or other strings to have an optimal number of bins chosen. This is especially useful for visualisation libraries.
An (out of date) notebook with my first code attempts can be found here. It discusses the reasoning behind the methods + justification and samples. I've tried to make it slightly simpler since then.
I've provided the implementation which defines functions within the histogram function as it allows for easy refactoring and I didn't want to put another 5 functions inside the base library that have no use anywhere else. The code is hopefully easy to refactor/change to better fit numpy's style guide/organisation.
I've included the docstring update as well with some notes on the different methods and a quick example on how to use the new call. I've tested the code and it seems to work as expected so far. Passing in
bins='s'
does lead to choosing the sturges estimator (as opposed to the scott estimator) but I can't say for sure what will happen.