Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Bug]: Gaps and overlapping areas between bins when using float16 #22622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
G-nn-r opened this issue Mar 8, 2022 · 9 comments · Fixed by #23047
Closed

[Bug]: Gaps and overlapping areas between bins when using float16 #22622

G-nn-r opened this issue Mar 8, 2022 · 9 comments · Fixed by #23047

Comments

@G-nn-r
Copy link

G-nn-r commented Mar 8, 2022

Bug summary

When creating a histogram out of float16 data, the bins are also calculated in float16. The lower precision can cause two errors:

  1. Gaps between certain bins.
  2. Two neighboring bins overlap each other (only visible when alpha < 1)

Code for reproduction

import numpy as np
import matplotlib.pyplot as plt
values = np.clip(np.random.normal(0.5, 0.3, size=1000), 0, 1).astype(np.float16)
plt.hist(values, bins=100, alpha=0.5)
plt.show()

Actual outcome

float16

Expected outcome

float32

Created by plt.hist(values.astype(np.float32), bins=100, alpha=0.5) plt.show()

Additional information

Possible solution
Calculate the bins in float32:

  • Determine minimal and maximal value in float16.
  • Convert min and max to float32.
  • Calculate the bin edges.

Theoretical possible, but unwanted solution
Convert data into float32 before calculating the histogram. This behavior does not make a lot of sense, as float16 is mostly used because of memory limitations (arrays with billions of values can easily take several gigabytes).

Operating system

Windows 10

Matplotlib Version

3.4.3

Matplotlib Backend

TkAgg

Python version

3.7.1

Jupyter version

No response

Installation

pip

@timhoffm
Copy link
Member

timhoffm commented Mar 8, 2022

To be checked: Can the same effect occur when using (numpy) int arrays?

@greglucas
Copy link
Contributor

Just a note that np.hist(float16) returns float16 edges.

You may want to try using "stairs" here instead, which won't draw the bars all the way down to zero and help avoid those artifacts.
plt.stairs(*np.histogram(values, bins=100), fill=True, alpha=0.5)

@oscargus
Copy link
Member

oscargus commented Mar 8, 2022

I am not sure, but it seems like possibly a problem in NumPy.

In[9]: cnt, bins = np.histogram(values, 100)

In [10]: bins
Out[10]: 
array([0.  , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
       0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21,
       0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32,
       0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43,
       0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54,
       0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65,
       0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76,
       0.77, 0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87,
       0.88, 0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98,
       0.99, 1.  ], dtype=float16)

In [11]: np.diff(bins)
Out[11]: 
array([0.01    , 0.01    , 0.009995, 0.01001 , 0.00998 , 0.01001 ,
       0.01001 , 0.01001 , 0.01001 , 0.00995 , 0.01001 , 0.01001 ,
       0.01001 , 0.01001 , 0.01001 , 0.01001 , 0.01001 , 0.01001 ,
       0.00989 , 0.01001 , 0.01001 , 0.01001 , 0.01001 , 0.01001 ,
       0.01001 , 0.01001 , 0.01001 , 0.01001 , 0.01001 , 0.01001 ,
       0.01001 , 0.01001 , 0.01001 , 0.01001 , 0.01001 , 0.01001 ,
       0.01001 , 0.009766, 0.01001 , 0.01001 , 0.01001 , 0.01001 ,
       0.01001 , 0.01001 , 0.01001 , 0.01001 , 0.01001 , 0.01001 ,
       0.01001 , 0.01001 , 0.009766, 0.010254, 0.009766, 0.010254,
       0.009766, 0.010254, 0.009766, 0.010254, 0.009766, 0.010254,
       0.009766, 0.010254, 0.009766, 0.010254, 0.009766, 0.010254,
       0.009766, 0.010254, 0.009766, 0.010254, 0.009766, 0.010254,
       0.009766, 0.010254, 0.009766, 0.009766, 0.010254, 0.009766,
       0.010254, 0.009766, 0.010254, 0.009766, 0.010254, 0.009766,
       0.010254, 0.009766, 0.010254, 0.009766, 0.010254, 0.009766,
       0.010254, 0.009766, 0.010254, 0.009766, 0.010254, 0.009766,
       0.010254, 0.009766, 0.010254, 0.009766], dtype=float16)

It looks like the diff is not really what is expected.

@oscargus
Copy link
Member

oscargus commented Mar 8, 2022

I am actually a bit doubtful if the bins are really float16 here though. I guess they are, since it is float16, not bfloat16.

@oscargus
Copy link
Member

oscargus commented Mar 8, 2022

It is possible to trigger it with quite high probability using three bins, so that may be an easier case to debug (second and third bar overlap). Bin edges and diff seems to be the same independent of overlap or not.

In [44]: bins
Out[44]: array([0.    , 0.3333, 0.6665, 1.    ], dtype=float16)

In [45]: np.diff(bins)
Out[45]: array([0.3333, 0.3333, 0.3335], dtype=float16)

@oscargus
Copy link
Member

oscargus commented Mar 8, 2022

There is an overlap in the plot data (so it is not caused by the actual plotting, possibly rounding the wrong way):

In [98]: bc.patches[1].get_corners()
Out[98]: 
array([[3.33251953e-01, 0.00000000e+00],
       [6.66992188e-01, 0.00000000e+00],
       [6.66992188e-01, 4.05000000e+02],
       [3.33251953e-01, 4.05000000e+02]])

In [99]: bc.patches[2].get_corners()
Out[99]: 
array([[  0.66601562,   0.        ],
       [  0.99951172,   0.        ],
       [  0.99951172, 314.        ],
       [  0.66601562, 314.        ]])

As the second bar ends at 6.66992188e-01 and the third bar starts at 0.66601562, this will happen.

@oscargus
Copy link
Member

oscargus commented Apr 3, 2022

A possibly easy way to solve this is to provide a keyword argument to bar/barh that makes sure that the bars are always adjacent, i.e., let bar/barh know that the next bar should have the same starting point as the previous bars end point. That keyword argument can then be called from from hist in case of an rwidth of 1.
This is probably the line causing the error:

left = x - width / 2

Something like np.diff(np.cumsum(x) - width/2) may work, but should then only be conditionally executed if the keyword argument is set.

(Then, I am not sure to what extent np.diff and np.cumsum are 100% numerically invariant, it is not trivial under floating-point arithmetic. But probably this will reduce the probability of errors anyway.)

@oscargus
Copy link
Member

oscargus commented Apr 3, 2022

To be checked: Can the same effect occur when using (numpy) int arrays?

Yes and no. As the int array will become a float64 after multiplying with a float (dr in the code), it is quite unlikely to happen. However, it is not theoretically impossible to obtain the same effect with float64, although not very likely that it will actually be seen in a plot (the accumulated numerical error should correspond to something close to half(?) a pixel). But I am quite sure that one can trigger this by trying.

@jklymak
Copy link
Member

jklymak commented Apr 4, 2022

If you force the bins to be float64, then you won't have this problem:

import numpy as np
import matplotlib.pyplot as plt
values = np.clip(np.random.normal(0.5, 0.3, size=1000), 0, 1).astype(np.float16)
n, bins = np.histogram(values, bins=100)
n, bins, patches = plt.hist(values, bins=np.array(bins, dtype='float64'), alpha=0.5)

plt.show()

so I think the reasonable fix here is simply for matplotlib to coerce the output from np.histogram to be floats - the output is turned to float64 when rendered anyways, and the extra memory for any visible number of bins is not going to matter.

@QuLogic QuLogic added this to the v3.6.0 milestone May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants