Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FIX: violinplot crashed if input variance was zero #4816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 12, 2015

Conversation

brunobeltran
Copy link
Contributor

plt.violinplot would always throw an error if any of the data columns were composed of a single repeated element.

This bug is described in detail in the stackexchage post:
http://stats.stackexchange.com/questions/89754/statsmodels-error-in-kde-on-a-list-of-repeated-values

Although the answer there claims that this is not a bug, I disagree. A point mass is a perfectly reasonable and common data point to have, and should not require the user to workaround it themselves.

The current default method of getting "coords" [min(X):max(X):100], ensures that the point mass will be correctly displayed by default.

The simple test added to tests/test_axes.py can be used to reproduce the behavior.

@@ -7165,6 +7165,9 @@ def violinplot(self, dataset, positions=None, vert=True, widths=0.5,
"""

def _kde_method(X, coords):
# fallback gracefully if the vector contains only one value
if np.all(X[0] == X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing :

@tacaswell tacaswell added this to the proposed next point release milestone Jul 29, 2015
@brunobeltran
Copy link
Contributor Author

Ahh, could have sworn I ran that. Long day.

Should I rebase or just leave the separate commits?

@tacaswell
Copy link
Member

Up to you.

Tagged this as proposed next release as I would like to understand the stats issue here a bit better.

@brunobeltran
Copy link
Contributor Author

Not sure your background (new to Python community), but I can probably give a quick explanation. The violinplot requires a kernel density estimation--an estimation of the probability density function of the data. The relevant function is, of course, mlab.GaussianKDE (which, btw, is a reimplementation of the fully-featured scipy.kde....).

The algorithm needs to scale its results by some factor times the determinant of the sample covariance matrix by default (calculated on line 3764 of mlab.py). Since this is done once per "violin" in the plot, and each "violin" corresponds to a 1D dataset, we're really talking about simply calculating 1/the sample variance in this case.

In the case when the user provides a set of data whose sample variance is zero---as is often the case in statistical simulations were the initial conditions are often chosen to be a point mass---the variance of the data is zero and so we get a simple divide-by-zero error. It's simply disguised as a linear algebra error since the calculation is np.linalg.det([number]).

Now, since the violin plot is designed to show the distribution of the data along the "violin"'s axis, the natural thing to do when all the data is at a point seems to be to simply represent this by a perfectly flat line, which is what my change does.

@WeatherGod
Copy link
Member

Haven't read the code yet, but the explanation makes sense to me.

On Wed, Jul 29, 2015 at 12:41 AM, Bruno Beltran [email protected]
wrote:

Not sure your background (new to Python community), but I can probably
give a quick explanation. The violinplot requires a kernel density
estimation--an estimation of the probability density function of the data.
The relevant function is, of course, mlab.GaussianKDE (which, btw, is a
reimplementation of the fully-featured scipy.kde....).

The algorithm needs to scale its results by some factor times the
determinant of the sample covariance matrix by default (calculated on line
3764 of mlab.py). Since this is done once per "violin" in the plot, and
each "violin" corresponds to a 1D dataset, we're really talking about
simply calculating 1/the sample variance in this case.

In the case when the user provides a set of data whose sample variance is
zero---as is often the case in statistical simulations were the initial
conditions are often chosen to be a point mass---the variance of the data
is zero and so we get a simple divide-by-zero error. It's simply disguised
as a linear algebra error since the calculation is np.linalg.det([number]).

Now, since the violin plot is designed to show the distribution of the
data along the "violin"'s axis, the natural thing to do when all the data
is at a point seems to be to simply represent this by a perfectly flat
line, which is what my change does.


Reply to this email directly or view it on GitHub
#4816 (comment)
.

@jenshnielsen
Copy link
Member

Any reason to not merge this?

WeatherGod added a commit that referenced this pull request Aug 12, 2015
FIX: violinplot crashed if input variance was zero
@WeatherGod WeatherGod merged commit d85a9a4 into matplotlib:master Aug 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants