FIX: violinplot crashed if input variance was zero #4816

brunobeltran · 2015-07-29T03:35:21Z

plt.violinplot would always throw an error if any of the data columns were composed of a single repeated element.

This bug is described in detail in the stackexchage post:
http://stats.stackexchange.com/questions/89754/statsmodels-error-in-kde-on-a-list-of-repeated-values

Although the answer there claims that this is not a bug, I disagree. A point mass is a perfectly reasonable and common data point to have, and should not require the user to workaround it themselves.

The current default method of getting "coords" [min(X):max(X):100], ensures that the point mass will be correctly displayed by default.

The simple test added to tests/test_axes.py can be used to reproduce the behavior.

tacaswell · 2015-07-29T03:56:22Z

lib/matplotlib/axes/_axes.py

@@ -7165,6 +7165,9 @@ def violinplot(self, dataset, positions=None, vert=True, widths=0.5,
        """

        def _kde_method(X, coords):
+            # fallback gracefully if the vector contains only one value
+            if np.all(X[0] == X)


missing :

brunobeltran · 2015-07-29T04:06:56Z

Ahh, could have sworn I ran that. Long day.

Should I rebase or just leave the separate commits?

tacaswell · 2015-07-29T04:14:20Z

Up to you.

Tagged this as proposed next release as I would like to understand the stats issue here a bit better.

brunobeltran · 2015-07-29T04:41:39Z

Not sure your background (new to Python community), but I can probably give a quick explanation. The violinplot requires a kernel density estimation--an estimation of the probability density function of the data. The relevant function is, of course, mlab.GaussianKDE (which, btw, is a reimplementation of the fully-featured scipy.kde....).

The algorithm needs to scale its results by some factor times the determinant of the sample covariance matrix by default (calculated on line 3764 of mlab.py). Since this is done once per "violin" in the plot, and each "violin" corresponds to a 1D dataset, we're really talking about simply calculating 1/the sample variance in this case.

In the case when the user provides a set of data whose sample variance is zero---as is often the case in statistical simulations were the initial conditions are often chosen to be a point mass---the variance of the data is zero and so we get a simple divide-by-zero error. It's simply disguised as a linear algebra error since the calculation is np.linalg.det([number]).

Now, since the violin plot is designed to show the distribution of the data along the "violin"'s axis, the natural thing to do when all the data is at a point seems to be to simply represent this by a perfectly flat line, which is what my change does.

WeatherGod · 2015-07-29T13:34:22Z

Haven't read the code yet, but the explanation makes sense to me.

On Wed, Jul 29, 2015 at 12:41 AM, Bruno Beltran [email protected]
wrote:

Not sure your background (new to Python community), but I can probably
give a quick explanation. The violinplot requires a kernel density
estimation--an estimation of the probability density function of the data.
The relevant function is, of course, mlab.GaussianKDE (which, btw, is a
reimplementation of the fully-featured scipy.kde....).

The algorithm needs to scale its results by some factor times the
determinant of the sample covariance matrix by default (calculated on line
3764 of mlab.py). Since this is done once per "violin" in the plot, and
each "violin" corresponds to a 1D dataset, we're really talking about
simply calculating 1/the sample variance in this case.

In the case when the user provides a set of data whose sample variance is
zero---as is often the case in statistical simulations were the initial
conditions are often chosen to be a point mass---the variance of the data
is zero and so we get a simple divide-by-zero error. It's simply disguised
as a linear algebra error since the calculation is np.linalg.det([number]).

Now, since the violin plot is designed to show the distribution of the
data along the "violin"'s axis, the natural thing to do when all the data
is at a point seems to be to simply represent this by a perfectly flat
line, which is what my change does.

—
Reply to this email directly or view it on GitHub
#4816 (comment)
.

jenshnielsen · 2015-08-11T19:07:43Z

Any reason to not merge this?

FIX: violinplot crashed if input variance was zero

tacaswell added the status: needs review label Jul 29, 2015

tacaswell reviewed Jul 29, 2015
View reviewed changes

tacaswell added this to the proposed next point release milestone Jul 29, 2015

FIX: violinplot crashed if input variance was zero

b30ff74

brunobeltran force-pushed the master branch from 41b5eaa to b30ff74 Compare July 29, 2015 04:20

WeatherGod added a commit that referenced this pull request Aug 12, 2015

Merge pull request #4816 from brunobeltran/master

d85a9a4

FIX: violinplot crashed if input variance was zero

WeatherGod merged commit d85a9a4 into matplotlib:master Aug 12, 2015

tacaswell removed the status: needs review label Aug 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX: violinplot crashed if input variance was zero #4816

FIX: violinplot crashed if input variance was zero #4816

Uh oh!

brunobeltran commented Jul 29, 2015

Uh oh!

tacaswell Jul 29, 2015

Uh oh!

brunobeltran commented Jul 29, 2015

Uh oh!

tacaswell commented Jul 29, 2015

Uh oh!

brunobeltran commented Jul 29, 2015

Uh oh!

WeatherGod commented Jul 29, 2015

Uh oh!

jenshnielsen commented Aug 11, 2015

Uh oh!

Uh oh!

Uh oh!

FIX: violinplot crashed if input variance was zero #4816

FIX: violinplot crashed if input variance was zero #4816

Uh oh!

Conversation

brunobeltran commented Jul 29, 2015

Uh oh!

tacaswell Jul 29, 2015

Choose a reason for hiding this comment

Uh oh!

brunobeltran commented Jul 29, 2015

Uh oh!

tacaswell commented Jul 29, 2015

Uh oh!

brunobeltran commented Jul 29, 2015

Uh oh!

WeatherGod commented Jul 29, 2015

Uh oh!

jenshnielsen commented Aug 11, 2015

Uh oh!

Uh oh!