Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add Axes.ecdf() method. #24728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 26, 2023
Merged

Add Axes.ecdf() method. #24728

merged 1 commit into from
Mar 26, 2023

Conversation

anntzer
Copy link
Contributor

@anntzer anntzer commented Dec 14, 2022

PR Summary

See discussion at #16561 (attn @Wrzlprmft). I chose to implement remove_redundant (under the name "compress") as I can see cases where it would be helpful for performance, but left "absolute" ecdfs unimplemented, as I have never seen them (and https://stats.stackexchange.com/questions/451601/what-are-absolute-ecdfs-called-if-anybody-uses-them didn't attract too much activity).

I updated the histogram_cumulative example to showcase this (as one should basically always use ecdf() instead of hist(..., cumulative=True, density=True), but let's not overdo it and immediately consider removing that, as it's probably extremely widely used). Note that I removed the reference to astropy's bin selection examples, as it's already mentioned elsewhere (e.g. the histogram_features example) and also doesn't really apply as is to cumulative histograms.


Note: I am not overly convinced by silently dropping nans; we could also error on them. (+/-inf should clearly be supported, as they have a non-ambiguous interpretation, which I have relied on before).

PR Checklist

Documentation and Tests

  • Has pytest style unit tests (and pytest passes)
  • Documentation is sphinx and numpydoc compliant (the docs should build without error).
  • New plotting related features are documented with examples.

Release Notes

  • New features are marked with a .. versionadded:: directive in the docstring and documented in doc/users/next_whats_new/
  • API changes are marked with a .. versionchanged:: directive in the docstring and documented in doc/api/next_api_changes/
  • Release notes conform with instructions in next_whats_new/README.rst or next_api_changes/README.rst

@oscargus
Copy link
Member

.. added...?

@story645
Copy link
Member

I'm ambivalent about adding a new computational plotting method, especially since seaborn provides it, but if it exists then there's a lovely open space for it in plot types: stats and it is the type of basic thing worth adding there.

@jklymak
Copy link
Member

jklymak commented Dec 14, 2022

This is basically just ax.plot(np.sort(x), np.arange(len(x)))? Even with weights and normalizations, this seems pretty trivial for users to compute themselves, and I'm not a huge fan of providing folks with statistical black boxes in Matplotlib core.

@anntzer
Copy link
Contributor Author

anntzer commented Dec 14, 2022

@oscargus @story645 I have addressed your comments.

@jklymak This was already argued at #16561 (comment) and #16561 (comment): the tricky part is not the calculation (which is indeed pretty trivial), but in selecting the correct drawstyle and adding the correct point at the correct end so that the ecdf is indeed a step plot (see https://en.wikipedia.org/wiki/Empirical_distribution_function, https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/ecdfplot.htm, https://fr.mathworks.com/help/stats/ecdf.html, ...) instead of connecting the points with diagonal lines (which is actually wrong).

@anntzer
Copy link
Contributor Author

anntzer commented Dec 15, 2022

attn @Phlya @jaroslawr @mwaskom who have also commented on the original thread.

@jaroslawr
Copy link

I think this is great, histograms, box plots and violin plots are already in matplotlib and ECDFs are a broadly useful type of plot in many different application areas. I would probably use pure matplotlib if only it had this feature. Libraries like seaborn come with the cost of a whole abstraction layer over matplotlib - you often have to understand both seaborn and matplotlib and how seaborn passes arguments to matplotlib etc. Filling in a few gaps like this PR does would make using pure matplotlib much more attractive.

@anntzer anntzer force-pushed the ecdf branch 2 times, most recently from 66f1b28 to bf9e787 Compare December 15, 2022 10:51
@Wrzlprmft
Copy link

Would it make sense to warn the users who are still using hist(...,cumulative=True)? At the very least, the documentation should point to the new function.

@anntzer
Copy link
Contributor Author

anntzer commented Dec 15, 2022

@Wrzlprmft While I agree with you that there's rather few use cases of hist(..., cumulative=True) that are not better served than by ecdf(), let's first get the method in and decide on whether to include your warning later; I don't want to be derailed into a side-discussion.

@anntzer
Copy link
Contributor Author

anntzer commented Feb 2, 2023

General agreement during call was to error on nans instead, and also error on any input that has masked values. Also modify the docstring to suggest various ways to handle nans (ignore them, map them to +/-inf).

@jklymak jklymak marked this pull request as draft February 6, 2023 18:17
@jklymak
Copy link
Member

jklymak commented Feb 6, 2023

Draft until above is implemented

@anntzer
Copy link
Contributor Author

anntzer commented Feb 8, 2023

Done.

@anntzer anntzer marked this pull request as ready for review February 8, 2023 20:40
@jklymak
Copy link
Member

jklymak commented Feb 11, 2023

@oscargus did you have a chance to take a second look at this?

Copy link
Member

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - I have some minor suggestions, and the version change the docs is blocking, but feel free to self-merge when that's done

"""
Compute and plot the empirical cumulative distribution function of *x*.

.. versionadded:: 3.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs updating to 3.8


Returns
-------
Line2D
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link this?

x = x[argsort]
if weights is None:
# Ensure that we end at exactly 1, avoiding floating point errors.
cweights = (1 + np.arange(len(x))) / len(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused as to what cweights is here? perhaps a comment explaining what it is would be good?

# Ensure that we end at exactly 1, avoiding floating point errors.
cweights = (1 + np.arange(len(x))) / len(x)
else:
weights = np.take(weights, argsort)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
weights = np.take(weights, argsort)
# Sort weights
weights = np.take(weights, argsort)

weights = np.take(weights, argsort)
cweights = np.cumsum(weights / np.sum(weights))
if compress:
compress_idxs = [0, *(x[:-1] != x[1:]).nonzero()[0] + 1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
compress_idxs = [0, *(x[:-1] != x[1:]).nonzero()[0] + 1]
# Get indices of unique values in x
compress_idxs = [0, *(x[:-1] != x[1:]).nonzero()[0] + 1]

@anntzer
Copy link
Contributor Author

anntzer commented Feb 23, 2023

Thanks, I've handled all the comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: proper ECDF
10 participants