Thanks to visit codestin.com
Credit goes to github.com

Skip to content

np.histogram_bin_edges not returning expected bin width for argument bin = 'fd' #18319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jamiebarker0310 opened this issue Feb 3, 2021 · 6 comments

Comments

@jamiebarker0310
Copy link
Contributor

Hey,

I was playing about with histogram bin widths recently and was trying to test my method for the Freedman Diaconis Estimator by checking it against the "np.histogram_bin_edges". I spotted that I was getting different values when I calculated the bin width using this method and am suspecting there is some rounding going on - and I just want to check whether it is intentional or not.

This is because when I choose the length of my data to be a cube number my results match. However, in other cases my answers differ (see example below).

Apologies if this is expected behaviour or there is a bug in my method.

Reproducing code example:

import numpy as np 

def freedmanDiaconus(data):
    n = len(data)
    # calculate quartiles
    x_q1, x_q3 = np.percentile(data, [25, 75])
    # calculate n data
    x_n = len(data)
    # calculate IQR
    x_iqr = x_q3 - x_q1
    # calculate Freedman Diaconus
    freedman_diaconus = 2*x_iqr*n**(-1/3)
    return freedman_diaconus

x = [1,2,3,4,5,6,7,8,9]

np_bins = np.histogram_bin_edges(x, bins='fd')
np_bin_width = np_bins[1] - np_bins[0]
fd_width = freedmanDiaconus(x)
print(fd_width, np_bin_width)

Output:

3.8459988541530894 2.6666666666666665

NumPy/Python version information:

1.19.5 3.6.9 (default, Oct 8 2020, 12:12:24)
[GCC 8.4.0]

@madphysicist
Copy link
Contributor

madphysicist commented Feb 4, 2021

The code for the Freedman Diaconis binwidth estimator is here:

def _hist_bin_fd(x, range):

Aside from cruft, the function is two lines long:

iqr = np.subtract(*np.percentile(x, [75, 25]))
return 2.0 * iqr * x.size ** (-1.0 / 3.0)

This is pretty much identical to your method, and yields the same result. However, some additional magic happens under the hood in _get_bin_edges to turn this width into bin edges:

  1. First, we round the width up to the nearest bin count that fits the range exactly:

    n_equal_bins = int(np.ceil(_unsigned_subtract(last_edge, first_edge) / width))

    This is equivalent to int(np.ceil(np.subtract(9, 1) / 3.8459988541530894)) in your case, so the result is 3 bins. Given that the original bin width would require 2.080083823051904 bins to fit across the range, this seems reasonable.

  2. Then we generate the edges to fill the range exactly:

    bin_edges = np.linspace(

As you can see, it is the rounding step that is responsible for the difference in results. So your observation of the difference is correct, but the code is using the same result as yours along with some additional transformations. We could argue ad nauseum about rounding vs rounding up or rounding down, but the result would never match the optimal bin width in any case. One other alternative I can think of is keeping the optimal width, and setting the start/end points to fully contain the range. This would result in biases of the edge bins, however, which is generally undesirable.

@jamiebarker0310
Copy link
Contributor Author

Hey, thanks for getting back to me! Your explanation makes a lot of sense.

I see what the code is doing now, and it makes sense. I guess I just find it surprising that if you specify the bin width method, the produced bin width is not what the formula in the documentation. However, if the code is behaving as expected I guess it's best to close the issue.

@madphysicist
Copy link
Contributor

It may be worth documenting this somewhere. A sentence like "The actual number of bins is always chosen to divide the range into an integer number of bins that is at least as large as the estimate.", or something to that effect in histogram. Would you like to open a PR to include that, or would you like me to do it?

@jamiebarker0310
Copy link
Contributor Author

I can give that a go - thanks!

@madphysicist
Copy link
Contributor

madphysicist commented Feb 10, 2021

You can now close this issue. I suspect it's going to be a handy source/reference for places like Stack Overflow. Nice work!

@jamiebarker0310
Copy link
Contributor Author

Cool, thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants