Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Behavior of hist() with normed=True changes from v2.0 to v2.1 #9557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jakevdp opened this issue Oct 24, 2017 · 28 comments
Closed

Behavior of hist() with normed=True changes from v2.0 to v2.1 #9557

jakevdp opened this issue Oct 24, 2017 · 28 comments
Labels
Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions.
Milestone

Comments

@jakevdp
Copy link
Contributor

jakevdp commented Oct 24, 2017

Found in the context of astropy/astropy#6786

When hist() is passed irregular bins with normed=True, the output is different between matplotlib 2.0 and 2.1. Here is a test script to reproduce the issue:

# Python 3.6
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
rng = np.random.RandomState(829)

t = np.concatenate([rng.randn(100),
                    2 + 0.1 * rng.randn(100),
                    5 + 3 * rng.randn(100)])
plt.hist(t, bins=[-5, -3, -2, -1, -0.5, 0, 2, 4, 5, 10], normed=True)
plt.title(f'matplotlib v{matplotlib.__version__}')
plt.savefig(f'hist-{matplotlib.__version__}.png')
plt.show()

hist-2.0.2.png:
hist-2 0 2

hist-2.1.0.png:
hist-2 1 0

@tacaswell tacaswell added this to the v2.1.1 milestone Oct 24, 2017
@tacaswell tacaswell added the Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions. label Oct 24, 2017
@tacaswell
Copy link
Member

That seems very very bad.

If you do it 'by hand' which one is correct?

@jklymak
Copy link
Member

jklymak commented Oct 24, 2017

I think ax.hist was changed to just pass through to np.histogram, so if the "new" way is wrong, either the plotting is wrong or numpy is wrong....

@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 24, 2017

The 2.0.2 result matches what I would compute by-hand

# compute normalized heights by hand
heights, bins = np.histogram(t, bins=[-5, -3, -2, -1, -0.5, 0, 2, 4, 5, 10])
bin_widths = bins[1:] - bins[:-1]
normed_heights = heights / bin_widths / heights.sum()
bin_centers = 0.5 * (bins[1:] + bins[:-1])

# compare to normed hist output
plt.hist(t, bins=[-5, -3, -2, -1, -0.5, 0, 2, 4, 5, 10], normed=True)
plt.plot(bin_centers, normed_heights, 'ok');
plt.title(f'matplotlib v{matplotlib.__version__}')

download-1

@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 24, 2017

Note that the numpy normed argument is known to be problematic with unequal bins, which is why it's being deprecated. If mpl2.1 changed to just passing normed through to numpy, that's likely the root of the issue. See https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.histogram.html

@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 24, 2017

This seems to be the critical difference:

>>> print(np.__version__)
1.13.1
>>> print(np.histogram(t, bins, normed=True)[0])
[ 0.          0.00482315  0.03215434  0.0192926   0.03215434  0.1607717
  0.1318328   0.02250804  0.0659164 ]
>>> print(np.histogram(t, bins, density=True)[0])
[ 0.          0.01027397  0.06849315  0.08219178  0.1369863   0.17123288
  0.14041096  0.04794521  0.02808219]

In mpl 2.1, the normed parameter is transformed to the density parameter before being sent to numpy; see #8993

I'm honestly not certain what the numpy density=True option does, but it seems to be something strange.

@jklymak
Copy link
Member

jklymak commented Oct 24, 2017

https://github.com/numpy/numpy/blob/v1.13.0/numpy/lib/function_base.py#L432-L826

It seems to do what you are doing....

@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 24, 2017

@jklymak – thanks, you're right.

It looks like matplotlib 2.0 normed=True matches numpy density=True, while matplotlib 2.1 normed=True matches numpy's problematic normed=True. Looking at the code, it's unclear to me why that is.

@jklymak
Copy link
Member

jklymak commented Oct 24, 2017

My error, it is hist2d that just passes to numpy. We have our own hist. Yay! However, not sure what is up w/ numpy.

Ours, you are supposed to use density not normed....

@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 24, 2017

Here is a concise test case that passes in matplotlib 2.0, but not in 2.1 (tested with numpy 1.13.1):

import numpy as np
import matplotlib.pyplot as plt
from numpy.testing import assert_allclose

def test_hist_normed():
    rng = np.random.RandomState(57483)
    t = rng.randn(100)
    bins = [-3, -1, -0.5, 0, 1, 5]
    mpl_heights, _, _ = plt.hist(t, bins=bins, normed=True)
    np_heights, _ = np.histogram(t, bins=bins, density=True)
    assert_allclose(mpl_heights, np_heights)

@jklymak
Copy link
Member

jklymak commented Oct 24, 2017

What a mess:

t = np.concatenate([rng.randn(100),
                    2 + 0.1 * rng.randn(100),
                    5 + 3 * rng.randn(100)])
# compute normalized heights by hand
bins0 = [-5, -3, -2.2, -1, -0.5, 0, 2, 4.2, 5.6, 10]
heights, bins = np.histogram(t, bins=bins0)
bin_widths = bins[1:] - bins[:-1]
normed_heights = heights / bin_widths / heights.sum()
bin_centers = 0.5 * (bins[1:] + bins[:-1])

# compare to density hist output
hn, hbins = np.histogram(t, bins=bins0, density=True)
# compare to normed hist output
hn0, hbins0 = np.histogram(t, bins=bins0, normed=True)

# compare to mpl....
hn2, hbins2, patches = plt.hist(t, bins=bins0, density=True, label='MPL plot')

plt.plot(bin_centers, normed_heights, 'ok', label='by hand');
plt.plot(0.5*(hbins[1:]+hbins[:-1]), hn, 'or', ms=3., label='np: density=True');
plt.plot(0.5*(hbins0[1:]+hbins0[:-1]), hn0, 'oc', ms=6., label='np: normed=True');
plt.title(f'matplotlib v{matplotlib.__version__}')
plt.legend()

test

@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 24, 2017

Here's the issue – in 2.1, the normalization is applied twice: once by matplotlib, and once by numpy: https://github.com/matplotlib/matplotlib/blob/v2.1.x/lib/matplotlib/axes/_axes.py#L6201-L6224

@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 24, 2017

The problematic lines seem to be removed already from master, though they are still in the 2.1.x branch.

@jklymak
Copy link
Member

jklymak commented Oct 24, 2017

Ha ha. Embarassingly, I even commented on it: #9121 Even more embarassingly, I hadn't updated that branch of master yet.

Someone should still probably talk to numpy about their "normed" kw!

test

@jklymak jklymak closed this as completed Oct 24, 2017
@afvincent
Copy link
Contributor

afvincent commented Oct 24, 2017

@jklymak Well if I am correct, normed is deprecated in Numpy more or less because it is misbehaving with uneven bins, so upstream is likely to already know about it, isn't it?

Edit: Here it is in the docstring of numpy.histogram.

@jklymak
Copy link
Member

jklymak commented Oct 24, 2017

Ah sorry. I thought they were deprecating density.

@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 24, 2017

I could open a PR with that test case – I suspect it will pass on master, and we could then back-port appropriate changes to 2.1.x

@jklymak
Copy link
Member

jklymak commented Oct 24, 2017

It’d be great to have a test w unequal bins.

@tacaswell
Copy link
Member

Did we backport what ever fixed this on master to 2.1.x? Re-opening to make sure that does not get lost (sorry if I am stepping on anyone's toes!).

@tacaswell tacaswell reopened this Oct 24, 2017
@jklymak
Copy link
Member

jklymak commented Oct 24, 2017

#9121. Not sure if it was backported!

@dstansby
Copy link
Member

Huh, just thought I had removed old code, didn't realise that I'd accidentally fixed anything with that PR!

@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 26, 2017

It looks like both #9563 and #9586 have been merged and the appropriate code is backported to 2.1.x. That will resolve this issue once the next 2.1.x bugfix is released.

@jakevdp
Copy link
Contributor Author

jakevdp commented Oct 26, 2017

Actually, #9586 hasn't been merged yet but the button is green if someone wants to do it 😄

@dopplershift
Copy link
Contributor

Merged. 🎉

@mshonichev
Copy link

mshonichev commented Mar 21, 2018

What exact version of matplotlib does contains the fix?
I ask, because it seems that I stuck onto that bug as well and I'm looking for a quick solution.

The code:

import matplotlib
matplotlib.use('Agg')

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
matplotlib.style.use('ggplot')

data = pd.read_csv('checkpoint-times.csv', usecols=[6], delimiter=';', header=None)
data = data * 1.0
fig, ax = plt.subplots()
num_bins=72
n, bins, patches = ax.hist(data, bins=num_bins, normed=0)
ax.set_xlabel('Checkpoint fsync time, ms')
ax.set_ylabel('Number of nodes')
ax.set_title('Checkpoint end fsync time distribution per cluster nodes')
plt.savefig('checkpoint-fsync-times.png')

Ubuntu
pandas (0.20.3)
numpy (1.13.1)
matplotlib (2.0.2)
gives proper graph as attach pic1.png
pic1

Centos
matplotlib (2.2.2)
numpy (1.14.2)
pandas (0.22.0)
gives incorrect pic2.png
pic2

Trying to play with bins or density=True/False gives no success, could you advice me a proper way to solve the issue?

@jakevdp
Copy link
Contributor Author

jakevdp commented Mar 21, 2018

@mshonichev – this bug would only change things if you were using unevenly-spaced bins; I don't think it's related to the problem you're having.

@mshonichev
Copy link

mmm... that might be exactly that case, the source data has only 144 points and they are not evenly distributed. Any workaround but for downgrading?

checkpoint-times.csv.txt

@jklymak
Copy link
Member

jklymak commented Mar 21, 2018

Can you open a new issue with a minimal (no extra calls) self contained example (no csv file)? But the normed kwarg is deprecated and I don’t know what passing zero in does.

@jakevdp
Copy link
Contributor Author

jakevdp commented Mar 21, 2018

The relevant piece is the spacing of the bins, not the spacing of the data points.

Since you use bins=72, the bins are evenly-spaced, and this issue will not be relevant to the bug you're seeing.

I would open a new matplotlib issue to ask about this bug, but try to put together an example that others can run: the one above relies on reading a data file that is unavailable to anyone else – see http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Release critical For bugs that make the library unusable (segfaults, incorrect plots, etc) and major regressions.
Projects
None yet
Development

No branches or pull requests

7 participants