Closed
Description
While the hist
method allows to plot empirical cumulative distribution functions (ECDFs) with the cumulative
and density
keywords, this is far from perfect:
- It uses a binning, which is unnecessary for ECDFs and either leads to inaccuracies (low binning) or a waste of time and spurious plot points for small datasets (high binning).
- The default plot style is bars which are not the typical way to visualise ECDFs.
- It is not easy to find. I bet that this very issue will soon be the first thing people wanting to plot the ECDF with Matplotlib will find.
Please implement a “proper” ECDF.
Half a pull request
The following should contain all the major components necessary to implement this. If I had the the time and patience to work through Matplotlib’s code structure, testing, replicating everything for the legacy interface and so on, I would make a pull request.
from warnings import warn
import numpy as np
from matplotlib import pyplot as plt
def ecdf_points(data,remove_redundant,absolute=False):
assert data.ndim==1
assert np.all(np.isfinite(data))
abscissae = np.append(data,np.inf)
abscissae.sort()
if absolute:
ordinates = np.arange(len(abscissae),dtype=int)
else:
ordinates = np.linspace(0,1,len(abscissae))
if remove_redundant:
needed = np.ediff1d(abscissae,to_begin=True).astype(bool)
abscissae = abscissae[needed]
ordinates = ordinates[needed]
return abscissae, ordinates
test_data = np.array([4,2,3,2,4,3,1,2,1,4])
np.testing.assert_allclose(
ecdf_points(test_data,remove_redundant=True),
[ [1,2,3,4,np.inf], [0,0.2,0.5,0.7,1.0] ],
)
np.testing.assert_equal(
ecdf_points(test_data,remove_redundant=True,absolute=True),
[ [1,2,3,4,np.inf], [0,2,5,7,10] ],
)
def ecdf(axes,x,*args,remove_redundant=False,absolute=False,**kwargs):
"""
Plot the empirical cumulative distribution function, i.e., the fraction of data points less than or equal than some value. Returns the supporting points of the underlying step plot.
Parameters
----------
x : one-dimensional array
Input values
remove_redundant : bool, optional
Whether spurious points arising from identical data points shall be removed. This is mainly useful for large datasets containing many identical data points, where you get an unnecessarily large or inefficient figure otherwise. If your data does not contain any duplicate points, this has no effect and just uses some time and memory.
absolute : bool, optional
Whether to plot the absolute number (instead of the fraction) of data points less than or equal than some value.
Returns
-------
abscissae : array
The positions of steps in the ECDF.
ordinates : array
The lower values of the steps.
lines : Lines object
The lines object produced by the underlying plot call.
"""
if "drawstyle" in kwargs and kwargs["drawstyle"]!="steps-pre":
warn("Overwriting 'drawstyle' argument with 'steps-pre' to achieve a correct ECDF.")
kwargs["drawstyle"] = "steps-pre"
points = ecdf_points(x,remove_redundant,absolute)
lines = axes.plot(*points,*args,**kwargs)
return (*points,lines)
fig,axes = plt.subplots()
ecdf(axes,test_data,remove_redundant=True,absolute=True)
axes.hist(test_data,bins=1000,cumulative=True);
fig,axes = plt.subplots()
ecdf(axes,test_data,remove_redundant=True)
axes.hist(test_data,bins=1000,cumulative=True,density=True);