np.histogram_bin_edges not returning expected bin width for argument bin = 'fd' #18319

jamiebarker0310 · 2021-02-03T20:29:44Z

Hey,

I was playing about with histogram bin widths recently and was trying to test my method for the Freedman Diaconis Estimator by checking it against the "np.histogram_bin_edges". I spotted that I was getting different values when I calculated the bin width using this method and am suspecting there is some rounding going on - and I just want to check whether it is intentional or not.

This is because when I choose the length of my data to be a cube number my results match. However, in other cases my answers differ (see example below).

Apologies if this is expected behaviour or there is a bug in my method.

Reproducing code example:

import numpy as np 

def freedmanDiaconus(data):
    n = len(data)
    # calculate quartiles
    x_q1, x_q3 = np.percentile(data, [25, 75])
    # calculate n data
    x_n = len(data)
    # calculate IQR
    x_iqr = x_q3 - x_q1
    # calculate Freedman Diaconus
    freedman_diaconus = 2*x_iqr*n**(-1/3)
    return freedman_diaconus

x = [1,2,3,4,5,6,7,8,9]

np_bins = np.histogram_bin_edges(x, bins='fd')
np_bin_width = np_bins[1] - np_bins[0]
fd_width = freedmanDiaconus(x)
print(fd_width, np_bin_width)

Output:

3.8459988541530894 2.6666666666666665

NumPy/Python version information:

1.19.5 3.6.9 (default, Oct 8 2020, 12:12:24)
[GCC 8.4.0]

The text was updated successfully, but these errors were encountered:

madphysicist · 2021-02-04T15:18:15Z

The code for the Freedman Diaconis binwidth estimator is here:

numpy/numpy/lib/histograms.py

Line 199 in f36e940

def _hist_bin_fd(x, range):

Aside from cruft, the function is two lines long:

iqr = np.subtract(*np.percentile(x, [75, 25]))
return 2.0 * iqr * x.size ** (-1.0 / 3.0)

This is pretty much identical to your method, and yields the same result. However, some additional magic happens under the hood in _get_bin_edges to turn this width into bin edges:

First, we round the width up to the nearest bin count that fits the range exactly:

numpy/numpy/lib/histograms.py

Line 411 in f36e940

n_equal_bins = int(np.ceil(_unsigned_subtract(last_edge, first_edge) / width))

This is equivalent to int(np.ceil(np.subtract(9, 1) / 3.8459988541530894)) in your case, so the result is 3 bins. Given that the original bin width would require 2.080083823051904 bins to fit across the range, this seems reasonable.
Then we generate the edges to fill the range exactly:

numpy/numpy/lib/histograms.py

Line 446 in f36e940

bin_edges = np.linspace(

As you can see, it is the rounding step that is responsible for the difference in results. So your observation of the difference is correct, but the code is using the same result as yours along with some additional transformations. We could argue ad nauseum about rounding vs rounding up or rounding down, but the result would never match the optimal bin width in any case. One other alternative I can think of is keeping the optimal width, and setting the start/end points to fully contain the range. This would result in biases of the edge bins, however, which is generally undesirable.

jamiebarker0310 · 2021-02-04T18:02:47Z

Hey, thanks for getting back to me! Your explanation makes a lot of sense.

I see what the code is doing now, and it makes sense. I guess I just find it surprising that if you specify the bin width method, the produced bin width is not what the formula in the documentation. However, if the code is behaving as expected I guess it's best to close the issue.

madphysicist · 2021-02-04T19:21:29Z

It may be worth documenting this somewhere. A sentence like "The actual number of bins is always chosen to divide the range into an integer number of bins that is at least as large as the estimate.", or something to that effect in histogram. Would you like to open a PR to include that, or would you like me to do it?

jamiebarker0310 · 2021-02-04T19:40:45Z

I can give that a go - thanks!

madphysicist · 2021-02-10T19:48:32Z

You can now close this issue. I suspect it's going to be a handy source/reference for places like Stack Overflow. Nice work!

jamiebarker0310 · 2021-02-10T19:49:39Z

Cool, thanks for the help!

madphysicist mentioned this issue Feb 4, 2021

DOC: add links to polynomial function/class listing #18320

Merged

jamiebarker0310 mentioned this issue Feb 6, 2021

DOC: Added sentence to docstring of histogram_bin_edges to explain bin width #18344

Merged

jamiebarker0310 closed this as completed Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

np.histogram_bin_edges not returning expected bin width for argument bin = 'fd' #18319

np.histogram_bin_edges not returning expected bin width for argument bin = 'fd' #18319

jamiebarker0310 commented Feb 3, 2021

madphysicist commented Feb 4, 2021 •

edited

Loading

Uh oh!

jamiebarker0310 commented Feb 4, 2021

Uh oh!

madphysicist commented Feb 4, 2021

Uh oh!

jamiebarker0310 commented Feb 4, 2021

Uh oh!

madphysicist commented Feb 10, 2021 •

edited

Loading

Uh oh!

jamiebarker0310 commented Feb 10, 2021

Uh oh!

Uh oh!

np.histogram_bin_edges not returning expected bin width for argument bin = 'fd' #18319

np.histogram_bin_edges not returning expected bin width for argument bin = 'fd' #18319

Comments

jamiebarker0310 commented Feb 3, 2021

Reproducing code example:

Output:

NumPy/Python version information:

madphysicist commented Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamiebarker0310 commented Feb 4, 2021

Uh oh!

madphysicist commented Feb 4, 2021

Uh oh!

jamiebarker0310 commented Feb 4, 2021

Uh oh!

madphysicist commented Feb 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamiebarker0310 commented Feb 10, 2021

Uh oh!

madphysicist commented Feb 4, 2021 •

edited

Loading

madphysicist commented Feb 10, 2021 •

edited

Loading