Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Feature request: proper ECDF #16561

Closed
@Wrzlprmft

Description

@Wrzlprmft

While the hist method allows to plot empirical cumulative distribution functions (ECDFs) with the cumulative and density keywords, this is far from perfect:

  • It uses a binning, which is unnecessary for ECDFs and either leads to inaccuracies (low binning) or a waste of time and spurious plot points for small datasets (high binning).
  • The default plot style is bars which are not the typical way to visualise ECDFs.
  • It is not easy to find. I bet that this very issue will soon be the first thing people wanting to plot the ECDF with Matplotlib will find.

Please implement a “proper” ECDF.

Half a pull request

The following should contain all the major components necessary to implement this. If I had the the time and patience to work through Matplotlib’s code structure, testing, replicating everything for the legacy interface and so on, I would make a pull request.

from warnings import warn
import numpy as np
from matplotlib import pyplot as plt

def ecdf_points(data,remove_redundant,absolute=False):
    assert data.ndim==1
    assert np.all(np.isfinite(data))
    
    abscissae = np.append(data,np.inf)
    abscissae.sort()
    
    if absolute:
        ordinates = np.arange(len(abscissae),dtype=int)
    else:
        ordinates = np.linspace(0,1,len(abscissae)) 
    
    if remove_redundant:
        needed = np.ediff1d(abscissae,to_begin=True).astype(bool)
        abscissae = abscissae[needed]
        ordinates = ordinates[needed]
    
    return abscissae, ordinates

test_data = np.array([4,2,3,2,4,3,1,2,1,4])
np.testing.assert_allclose(
    ecdf_points(test_data,remove_redundant=True),
    [ [1,2,3,4,np.inf], [0,0.2,0.5,0.7,1.0] ],
)
np.testing.assert_equal(
    ecdf_points(test_data,remove_redundant=True,absolute=True),
    [ [1,2,3,4,np.inf], [0,2,5,7,10] ],
)


def ecdf(axes,x,*args,remove_redundant=False,absolute=False,**kwargs):
    """
        Plot the empirical cumulative distribution function, i.e., the fraction of data points less than or equal than some value. Returns the supporting points of the underlying step plot.

        Parameters
        ----------
        x : one-dimensional array
            Input values
        
        remove_redundant : bool, optional
            Whether spurious points arising from identical data points shall be removed. This is mainly useful for large datasets containing many identical data points, where you get an unnecessarily large or inefficient figure otherwise. If your data does not contain any duplicate points, this has no effect and just uses some time and memory.
        
        absolute : bool, optional
            Whether to plot the absolute number (instead of the fraction) of data points less than or equal than some value.
            
        Returns
        -------
        abscissae : array
            The positions of steps in the ECDF.

        ordinates : array
            The lower values of the steps.
        
        lines : Lines object
            The lines object produced by the underlying plot call.
    """
    if "drawstyle" in kwargs and kwargs["drawstyle"]!="steps-pre":
        warn("Overwriting 'drawstyle' argument with 'steps-pre' to achieve a correct ECDF.")
    kwargs["drawstyle"] = "steps-pre"
    
    points = ecdf_points(x,remove_redundant,absolute)
    lines = axes.plot(*points,*args,**kwargs)
    return (*points,lines)

fig,axes = plt.subplots()
ecdf(axes,test_data,remove_redundant=True,absolute=True)
axes.hist(test_data,bins=1000,cumulative=True);

fig,axes = plt.subplots()
ecdf(axes,test_data,remove_redundant=True)
axes.hist(test_data,bins=1000,cumulative=True,density=True);

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions