-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
FIX: violinplot crashed if input variance was zero #4816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -7165,6 +7165,9 @@ def violinplot(self, dataset, positions=None, vert=True, widths=0.5, | |||
""" | |||
|
|||
def _kde_method(X, coords): | |||
# fallback gracefully if the vector contains only one value | |||
if np.all(X[0] == X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing :
Ahh, could have sworn I ran that. Long day. Should I rebase or just leave the separate commits? |
Up to you. Tagged this as proposed next release as I would like to understand the stats issue here a bit better. |
Not sure your background (new to Python community), but I can probably give a quick explanation. The violinplot requires a kernel density estimation--an estimation of the probability density function of the data. The relevant function is, of course, mlab.GaussianKDE (which, btw, is a reimplementation of the fully-featured scipy.kde....). The algorithm needs to scale its results by some factor times the determinant of the sample covariance matrix by default (calculated on line 3764 of mlab.py). Since this is done once per "violin" in the plot, and each "violin" corresponds to a 1D dataset, we're really talking about simply calculating In the case when the user provides a set of data whose sample variance is zero---as is often the case in statistical simulations were the initial conditions are often chosen to be a point mass---the variance of the data is zero and so we get a simple divide-by-zero error. It's simply disguised as a linear algebra error since the calculation is Now, since the violin plot is designed to show the distribution of the data along the "violin"'s axis, the natural thing to do when all the data is at a point seems to be to simply represent this by a perfectly flat line, which is what my change does. |
Haven't read the code yet, but the explanation makes sense to me. On Wed, Jul 29, 2015 at 12:41 AM, Bruno Beltran [email protected]
|
Any reason to not merge this? |
FIX: violinplot crashed if input variance was zero
plt.violinplot would always throw an error if any of the data columns were composed of a single repeated element.
This bug is described in detail in the stackexchage post:
http://stats.stackexchange.com/questions/89754/statsmodels-error-in-kde-on-a-list-of-repeated-values
Although the answer there claims that this is not a bug, I disagree. A point mass is a perfectly reasonable and common data point to have, and should not require the user to workaround it themselves.
The current default method of getting "coords" [min(X):max(X):100], ensures that the point mass will be correctly displayed by default.
The simple test added to tests/test_axes.py can be used to reproduce the behavior.