Feature request: proper ECDF

While the `hist` method allows to plot empirical cumulative distribution functions (ECDFs) with the `cumulative` and `density` keywords, this is far from perfect:

* It uses a binning, which is unnecessary for ECDFs and either leads to inaccuracies (low binning) or a waste of time and spurious plot points for small datasets (high binning).
* The default plot style is bars which are not the typical way to visualise ECDFs.
* It is not easy to find. I bet that this very issue will soon be the first thing people wanting to plot the ECDF with Matplotlib will find.

Please implement a “proper” ECDF.

### Half a pull request

The following should contain all the major components necessary to implement this. If I had the the time and patience to work through Matplotlib’s code structure, testing, replicating everything for the legacy interface and so on, I would make a pull request.

``` Python 3
from warnings import warn
import numpy as np
from matplotlib import pyplot as plt

def ecdf_points(data,remove_redundant,absolute=False):
    assert data.ndim==1
    assert np.all(np.isfinite(data))
    
    abscissae = np.append(data,np.inf)
    abscissae.sort()
    
    if absolute:
        ordinates = np.arange(len(abscissae),dtype=int)
    else:
        ordinates = np.linspace(0,1,len(abscissae)) 
    
    if remove_redundant:
        needed = np.ediff1d(abscissae,to_begin=True).astype(bool)
        abscissae = abscissae[needed]
        ordinates = ordinates[needed]
    
    return abscissae, ordinates

test_data = np.array([4,2,3,2,4,3,1,2,1,4])
np.testing.assert_allclose(
    ecdf_points(test_data,remove_redundant=True),
    [ [1,2,3,4,np.inf], [0,0.2,0.5,0.7,1.0] ],
)
np.testing.assert_equal(
    ecdf_points(test_data,remove_redundant=True,absolute=True),
    [ [1,2,3,4,np.inf], [0,2,5,7,10] ],
)


def ecdf(axes,x,*args,remove_redundant=False,absolute=False,**kwargs):
    """
        Plot the empirical cumulative distribution function, i.e., the fraction of data points less than or equal than some value. Returns the supporting points of the underlying step plot.

        Parameters
        ----------
        x : one-dimensional array
            Input values
        
        remove_redundant : bool, optional
            Whether spurious points arising from identical data points shall be removed. This is mainly useful for large datasets containing many identical data points, where you get an unnecessarily large or inefficient figure otherwise. If your data does not contain any duplicate points, this has no effect and just uses some time and memory.
        
        absolute : bool, optional
            Whether to plot the absolute number (instead of the fraction) of data points less than or equal than some value.
            
        Returns
        -------
        abscissae : array
            The positions of steps in the ECDF.

        ordinates : array
            The lower values of the steps.
        
        lines : Lines object
            The lines object produced by the underlying plot call.
    """
    if "drawstyle" in kwargs and kwargs["drawstyle"]!="steps-pre":
        warn("Overwriting 'drawstyle' argument with 'steps-pre' to achieve a correct ECDF.")
    kwargs["drawstyle"] = "steps-pre"
    
    points = ecdf_points(x,remove_redundant,absolute)
    lines = axes.plot(*points,*args,**kwargs)
    return (*points,lines)

fig,axes = plt.subplots()
ecdf(axes,test_data,remove_redundant=True,absolute=True)
axes.hist(test_data,bins=1000,cumulative=True);

fig,axes = plt.subplots()
ecdf(axes,test_data,remove_redundant=True)
axes.hist(test_data,bins=1000,cumulative=True,density=True);
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature request: proper ECDF #16561

Half a pull request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Feature request: proper ECDF #16561

Description

Half a pull request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions